Genome-wide covariation analysis sheds light on the evolution of SARS-CoV-2
Researchers in the United States have computed genome-wide covariation within severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) to investigate interactions that could be of considerable importance in the prevention, diagnosis, and treatment of coronavirus disease 2019 (COVID-19).
When the researchers considered the level of variability within both the full genome and different virus clades, they found nucleotide variability differed between encoding regions of the full genome and between different clades.
Evan Cresswell-Clay and Vipul Periwal from The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in Bethesda, Maryland, say future extensions of this analysis will provide several avenues of investigation.
As the database of SARS-CoV-2 genomes grows, increasing variability will yield further insights into important interactions within the genome.
Furthermore, the availability of this data over time will allow for chronological compartmentalization of genome datasets that could be used to study the temporal evolution of the virus.
The analysis can also be applied to other diseases for which more data is becoming available, says the team.
A pre-print version of the research paper is available on the bioRxiv* server, while the article undergoes peer review.
More about the SARS-CoV-2 genome
The genome of the SARS-CoV-2 virus – the agent responsible for the COVID-19-pandemic – was first characterized in December 2019.
The genome is around 30 kilobases in length and contains several open reading frames, including ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10. These ORFs encode for non-structural proteins, while specific genomic regions encode four structural proteins, of which the spike protein is the largest.
The spike protein is the surface structure SARS-CoV-2 uses to bind to and infect host cells. The other three structural proteins include the envelope (E) and membrane (M) proteins that form the viral envelope and the nucleocapsid (N) protein that is involved in viral assembly.
The EpiCoV database
The design of vaccines and therapies depends on the structure and mutational stability of proteins encoded in the ORFs of the genome.
While the reference genome is used for most studies, a growing body of available data can be used to monitor variations in the genome and analyze the virus’s evolution.
This data – assembled by GISAID (Global Initiative on Sharing Avian Influenza Data) – has enabled different SARS-CoV-2 strains to be documented in a new database called EpiCoV.
Since the first viral strain was entered on 10th January 2020, the database has grown to include 292,000 submissions.
Now, Cresswell-Clay and Periwal have used 137,636 of these documented strains to analyze the evolution of SARS-CoV-2.
“The variation of the virus’s genetic structure is of considerable medical and biological importance for prevention, diagnosis, and therapy,” writes the team.
Comparative RNA sequence analysis has long been used to study co-evolution via covariance of nucleotide mutations. However, separating the indirect and direct interactions that lead to such covariation has been challenging, say the researchers.
What did the current study involve?
The team used an optimization method called Expectation Reflection together with Direct Coupling Analysis to compute the genome-wide covariation within SARS-CoV-2 and infer direct interactions within the viral genome.
“These interactions may also provide information on protein-protein interaction,” writes the team. “Additionally, this analysis could be useful in vaccine development, aiding in efforts to mitigate ‘escape pathways’ for the virus to use in future strains.”
The team identified genome interactions both within individual encoding regions and between different encoding regions throughout the genome.
The ORF1ab and Spike regions showed the most significant variability within the dataset.
Genome-wide interaction maps also expressed determinant positions of all clades available, while interaction maps of individual clades revealed clade-specific co-evolution of nucleotide positions.
Nucleotide variability was different both between encoding regions of the full genome and between different clades. Region-specific incidences were not consistent between clades, with different variability expressed in individual regions of different clades.
The analysis could help future research
Cresswell-Clay and Periwal say future extensions of this analysis could provide several research opportunities.
“First, as the database of SARS CoV-2 genomes grows, the incidence and overall variability will increase, yielding further insights into genome interactions,” writes the team.
The increased availability of data over time will enable chronological compartmentalization of genome datasets and comparison of interaction maps across the temporal evolution of the virus.
“Second, this analysis can also be applied to diseases for which there is more data available, as the importance of genome interactions is not SARS-CoV-2 specific,” say the researchers.
*Important Notice
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.