LinearTurboFold: Linear time global prediction for the conserved structure of RNA homologues of SARS-CoV-2 | NASA

2021-12-13 18:12:56 By : Ms. Amily Sun

View all hidden authors and organizations

Edited by Michael Waterman, Department of Biological Sciences, University of Southern California, Los Angeles, California; received September 13, 2021; accepted November 5, 2021

The conserved RNA structure is essential for the design of diagnostic and therapeutic tools for many diseases, including COVID-19. However, the existing algorithms are too slow to simulate the global structure of the full-length RNA virus genome. We proposed LinearTurboFold, which is a linear time algorithm that is several orders of magnitude faster. As far as we know, this is the first to fold and align the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) at the same time. The whole genome method of the body, the longest known RNA virus (~30 kb). Our work enables unprecedented global structural analysis and captures remote interactions that cannot be achieved by existing algorithms but are critical to RNA function. LinearTurboFold is a universal technology for full-length genome research that can help fight current and future epidemics.

The continuous emergence of COVID-19 variants has reduced the effectiveness of existing vaccines and test kits. Therefore, it is important to identify the conserved structure in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome as a potential target for anti-mutation diagnosis and treatment. However, the algorithms that predict these conserved structures that fold and align multiple RNA homologues at the same time scale at most with the cube of the sequence length, so it is not feasible for coronaviruses, which have the longest genome among RNA viruses (~ 30,000 nt). Therefore, existing SARS-CoV-2 structural modeling work resorts to single-sequence folding and local folding methods with short window sizes, which inevitably ignore long-range interactions that are critical to RNA function. Here, we introduce LinearTurboFold, which is an effective algorithm for folding RNA homologues. The algorithm is linearly proportional to the sequence length and can perform unprecedented global structural analysis of SARS-CoV-2. Surprisingly, on a set of SARS-CoV-2 and SARS-related genomes, LinearTurboFold's pure computer prediction is not only close to the experimental guidance model of the local structure, but also goes far beyond the end-to-end pairing between captures. they. The 5'and 3'untranslated regions (UTR) (approximately 29,800 nt apart) are perfectly matched to pure experimental work. In addition, LinearTurboFold recognizes undiscovered conserved structures and conserved accessible regions as small molecule drugs, antisense oligonucleotides, small interfering RNA (siRNA), CRISPR-Cas13 guide RNA and RT that are designed to be highly efficient and insensitive to mutations. -Potential targets for PCR primers. LinearTurboFold is a universal technology that can also be applied to other RNA viruses and full-length genome research, and will become a useful tool against current and future epidemics.

RNA plays an important role in many cellular processes (1, 2). In order to maintain their functions, the secondary structure of RNA homologues is conserved during evolution (3⇓ –5). These conservative structures provide key goals for diagnosis and treatment. Therefore, it is necessary to develop fast and accurate calculation methods to identify structurally conserved regions.

Generally, conservative structures involve compensatory base pair changes, in which two positions in the primary sequence are mutated during evolution, but one base pair is still retained; for example, AU or CG pairs replace GC pairs in homologous sequences . These compensatory changes provide strong evidence for an evolutionarily conserved structure (6⇓ ⇓ ⇓ –10). At the same time, when the structure is unknown, they can make sequence alignment more difficult. Initially, the process of determining conserved structures, called comparative sequence analysis, was manual and required a lot of insight to identify conserved structures. A notable early achievement was the determination of the conserved transfer RNA (tRNA) secondary structure (11). Compared with the crystal structure of ribosomal RNA, comparative analysis has also been proven to have an accuracy of 97%, where the model has been carefully improved over time (12).

In order to automate comparative analysis, three different algorithmic methods have been developed (13, 14). The first, the "joint folding and alignment" method, seeks to predict the structure and structural alignment of two or more sequences at the same time. This was first proposed by Sankoff (15) using a dynamic programming algorithm. The main limitation of this method is that the algorithm runs for k sequences with an average sequence length of n in O(n3k). Some software packages provide implementations of Sankoff algorithms (16⇓ ⇓ ⇓ ⇓ –21), which use simplification to reduce running time. The second "align and fold" method is to align the input sequence and predict the conserved structure that can be identified across the sequence in the alignment. This was described by Waterman (22) and subsequently improved and popularized by RNAalifold (23). The third "fold and then align" method is to predict the reasonable structure of the sequence, and then align the structure to determine the sequence alignment and the best conserved structure. Waterman (24) described this and implemented it in RNAforester (25) and MARNA (26) (SI appendix, Figure S1).

As an alternative, TurboFold II (27) is an extension of TurboFold (28) and provides a more computationally efficient method to align and fold sequences. Taking multiple unaligned sequences as input, TurboFold II iteratively optimizes alignment and structure prediction to make them closer to each other and converge to a conservative structure. When tested on RNA families with known structures and alignments, TurboFold II is more accurate than other methods (16, 18, 23, 29, 30).

However, TurboFold II’s tertiary run time and secondary memory usage prevent it from extending to longer sequences, such as the full-length Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genome, which contains approximately 30,000 nucleosides Acid; in fact, there is no joint alignment and folding method that can be extended to these genomes, which are the longest among RNA viruses. As a (not very principled) solution, most of the existing work on modeling the SARS-CoV-2 structure (31⇓ ⇓ ⇓ ⇓ –36) uses partial folding methods (37, 38) and The sliding window and the limited pairing distance give up all long-distance interactions and only consider one SARS-CoV-2 genome (Figure 1 B and C), ignoring the signals available in multiple homologous sequences. In order to meet this challenge, we designed a linearized version of TurboFold II, LinearTurboFold (Figure 1A), which is a global homology folding algorithm that scales linearly with the length of the sequence. As far as we know, this linear running time allows the first joint folding alignment algorithm to be extended to the full-length coronavirus genome. There are no restrictions on window size or pairing distance. It takes about 13 hours to analyze a group of 25 SARS-CoV homologs. . It also leads to a significant improvement in the accuracy of secondary structure prediction and alignment accuracy equal to or higher than all benchmarks.

(A) LinearTurboFold frame. Like TurboFold II, LinearTurboFold takes multiple unaligned homologous sequences as input, and outputs a secondary structure and a multiple sequence alignment (MSA) for each sequence. But unlike TurboFold II, LinearTurboFold uses two linearizations to ensure linear running time: linearization alignment calculation (module 1) to predict the posterior coincidence probability of all sequence pairs (the first four parts of the method) (red squares) ) And linearized partition function calculation (module 2) to estimate the base pairing probabilities (yellow triangles) of all sequences (methods, extrinsic information calculations and methods, and LinearPartition that uses extrinsic information to estimate base pairing probabilities). These two modules use each other's information and iteratively improve predictions (SI Appendix, Figure S2). After multiple iterations, module 3 generates the final multiple sequence alignment (methods, MSA generation, and secondary structure prediction), and module 4 predicts secondary structure. Module 5 can randomly sample the structure. (B and C) Previous research (31⇓ ⇓ ⇓ ⇓ –36) [except for pure experimental work by Ziv et al. (39)] Use a partial folding method with limited window size and maximum pairing distance. B shows the partial folding of the SARS-CoV-2 genome of Huston et al. (32), it uses a 3,000 nt window, which is 300 nt ahead. It also limits the distance between nucleotides that can form base pairs to 500. Some studies also use homologous sequences to identify conserved structures (32⇓ ⇓ ⇓ –36), but they only predict the structure of a genome and use sequence alignment to identify mutations. In contrast, LinearTurboFold is a global folding method that has no restrictions on sequence length or pairing distance. It folds and aligns homologues to obtain a conserved structure. Therefore, LinearTurboFold can even capture long-distance interactions in the entire genome (B and the long arc in Figure 3).

In a set of 25 SARS-CoV-2 and SARS-related homologous genomes, LinearTurboFold predicts close to the canonical structure (40) and the structure of several well-researched regions modeled with experimental data (32⇓ –34). Due to global folding instead of local folding, LinearTurboFold has discovered long-range interactions involving 5'and 3'untranslated regions (UTR) (approximately 29,800 nt apart), which is consistent with recent purely experimental work (39), but still far away And the partial folding method used in existing research (Figure 1 B and C). In short, our computer simulation method of folding multiple homologues can obtain results similar to an experimental guidance model of a genome, and sometimes even more accurate. In addition, LinearTurboFold has identified conserved structures supported by compensatory mutations, which are potential targets for small molecule drugs (41) and antisense oligonucleotides (ASO) (36). We further identified 1) regions conserved at the sequence level; 2) at least 15 nt long; and 3) as ASO (42), small interfering RNA (siRNA) (43), CRISPR-Cas13 guide RNA (gRNA) (44) and The potential targets of RT-PCR primers are accessible (that is, they may be completely unpaired) (45). LinearTurboFold is a universal technology that can also be applied to other RNA viruses (such as influenza, Ebola, HIV, Zika virus, etc.) and full-length genome research.

The framework of LinearTurboFold has two main aspects (Figure 1A): linearized structure perception pair-wise alignment estimation (module 1) and linearized homology structure prediction (module 2). LinearTurboFold iteratively improves alignment and structure prediction. Specifically, it merges the predicted base pairing probabilities (from Module 2) to update the pairing probabilities to form a structural alignment, and integrates structural information from homologous sequences through estimation. To modify the base pairing probability alignment probability of each sequence (from module 1) to detect conserved structures. After many iterations, LinearTurboFold generates the final multiple sequence alignment (MSA) based on the latest pairwise probability (module 3), and uses the latest pairwise probability (module 4) to predict the secondary structure.

LinearTurboFold uses two main linearization modules to achieve linear time with respect to sequence length: our recent work, LinearPartition (46) (Figure 1A, module 2), which approximates the RNA distribution function (47) and base pairing in linear time Probability, and the new algorithm, LinearAlignment (Module 1). LinearAlignment uses the same beam search heuristic (48) used by LinearPartition to align two sequences in linear time through a Hidden Markov Model (HMM). Finally, LinearTurboFold uses an accurate and linear time method called ThreshKnot (49) (Module 4) to assemble the secondary structure from the final base pairing probability.

LinearTurboFold also integrates a linear-time random sampling algorithm called LinearSampling (50) (Module 5), which independently samples the structure according to the homogenous perception partition function, and then calculates the probability of unmatched regions. This is an important attribute, for example, siRNA sequence design (43). Therefore, the overall end-to-end running time of LinearTurboFold has a linear relationship with the sequence length (the first seven parts of the method). Adjusted the number of iterations and other hyperparameters on the training set. As mentioned earlier (27, 28), the improvement after three iterations is negligible, so the optimal number of iterations is set to 3. On the test set, it is observed that LinearTurboFold has achieved the most substantial improvements in structural prediction and MSA accuracy in the first iteration, and continues to benefit from the next two iterations (SI Appendix, Figure S5), which is consistent with observations The results are consistent on the training set. After about 3 iterations, the structure prediction and MSA accuracy remained stable with little fluctuation. In order to better prove the improvement in each iteration, we visualized the base pairing probability and alignment coincidence probability of a set of five tRNAs from LinearTurboFold in the iteration (SI Appendix, Figures S6 and S7).

In order to evaluate the efficiency of LinearTurboFold for sequence length, we collected a dataset consisting of 7 RNA families ranging in length from 210 to 30,000 nt, including 5 families from the RNAStrAlign dataset (27) and 23S ribosomal RNA , HIV genome and SARS-CoV genome, the calculation of each family uses five homologous sequences (method, efficiency and scalability data set). Figure 2A compares the running time of LinearTurboFold and TurboFold II, and two Sankoff-style synchronous folding and alignment algorithms, LocARNA (local RNA alignment) and MXSCARNA. Obviously, LinearTurboFold linearly expands with the sequence length n, and is much faster than other super-linear expansion algorithms. Linearization in LinearTurboFold brings an order of magnitude speedup compared to the three TurboFold II. It only takes 12 minutes on the HIV family (average length 9,686 nt), while TurboFold II requires 3.1 days (372 times acceleration). More importantly, LinearTurboFold only takes 40 minutes on five SARS-CoV sequences, while all other benchmarks cannot be extended. Regarding memory usage (Figure 2B), LinearTurboFold consumes linear memory space with sequence length, while other benchmarks use two or more memory. In Figure 2 C and D, we also use a 16S ribosomal RNA (rRNA) group with a length of approximately 1,500 nt, showing the running time and memory usage for the number of homologs. The apparent complexity of LinearTurboFold relative to the group size k is higher than TurboFold II, because the running time of the latter is O(kn3+k2n2) and is dominated by the O(kn3) distribution function calculation, so scaling O(k1.4) is a rule of thumb . In contrast, LinearTurboFold linearizes the partition function and alignment module, so its overall running time becomes O(kn+k2n) and is dominated by the O(k2n) alignment module, so it scales O(k2) in practice. A similar analysis applies to memory usage (Figure 2E). *

End-to-end scalability and accuracy comparison. (A and B) Comparison of end-to-end runtime and memory usage with sequence length between benchmark test and LinearTurboFold. LinearTurboFold uses a beam size of 100 in the partition function and HMM alignment calculations, and runs all data sets through three iterations. (C and D) End-to-end runtime and memory usage compared to group size. As far as we know, LinearTurboFold is the first joint folding alignment algorithm that has been extended to the full-length coronavirus genome (~30,000 nt) due to its linear running time. (E) Comparison of runtime and space complexity between TurboFold II and LinearTurboFold. The main terms are shown in bold. (F and G) F1 accuracy scores for structure prediction and multiple sequence alignment (SI appendix, Table S1). LocARNA and MXSCARNA are Sankoff-style synchronous folding and alignment algorithms for homologous sequences. As a negative control, LinearPartition and Vienna RNAfold predict the structure of each sequence respectively; LinearAlignment and MAFFT generate sequence-level alignments; RNAalifold folding pre-aligned sequences (for example, from MAFFT) and predicted conservative structures. If P <0.05, the statistical significance between the baseline and LinearTurboFold (two-tailed permutation test) is marked with one star (⋆) ​​at the top of the corresponding bar, if P <0.01, then two stars (⋆⋆) ). The benchmarks with significantly lower accuracy than LinearTurboFold are marked with a black star, while benchmarks higher than LinearTurboFold are marked with a dark red star. In general, in terms of structural prediction, LinearTurboFold's accuracy is significantly higher than all evaluation benchmarks. In terms of multi-sequence alignment, its accuracy is comparable to TurboFold II and significantly higher than other methods (SI Appendix, Table S1).

We next compare the accuracy of secondary structure prediction and MSA between LinearTurboFold and several benchmark methods (methods, benchmarks). In addition to Sankoff-style LocARNA and MXSCARNA, we also considered three types of negative controls: 1) Single sequence folding (based on partition function), Vienna RNAfold (38) (-p mode) and LinearPartition; 2) Sequence alignment only, MAFFT (29) and LinearAlignment (a separate version of the alignment method developed for this work, but without structural information); and 3) an alignment and then folding method that predicts the common structure from MSA (SI Appendix, Figure S1 ), MAFFT + RNAalifold (23).

For secondary structure prediction, LinearTurboFold, TurboFold II and LocARNA achieve higher F1 scores than single sequence folding methods (Vienna RNAfold and LinearPartition) (Figure 2F), which indicates that folding with homology information performs better than folding sequences alone. Overall, LinearTurboFold's performance in structural forecasting is significantly better than all other benchmarks. For the accuracy of MSA (Figure 2G), the structural alignment from LinearTurboFold achieves higher accuracy on all four families than just the sequence alignment (LinearAlignment and MAFFT), especially for families with low sequence identity. On average, the performance of LinearTurboFold is comparable to TurboFold II, and it is significantly better than other benchmarks in terms of alignment. We also noticed that the structure prediction accuracy of the align-and-fold method (MAFFT + RNAalifold) largely depends on the alignment accuracy, and the sequence identity is low (for example, signal recognition particle [SRP] RNA) and when the sequence It is best when the identity is high (for example, 16S rRNA) (Figure 2 F and G).

RNA sequences with conserved secondary structure play an important biological role and provide potential targets. The current COVID-19 outbreak raises urgent requirements to determine potential targets for diagnosis and treatment. In view of the strong scalability and high precision, we use LinearTurboFold on a set of full-length SARS-CoV-2 and SARS-related (SARSr) genomes to obtain the global structure and identify highly conserved structural regions.

We used the greedy algorithm to select the 16 most diverse genomes (method, SARS-CoV- 2 data set). We have further expanded this group by adding nine SARS-related homologous genomes (five human SARS-CoV-1 and four bat coronaviruses) (53). In total, we constructed a data set containing 25 full-length genomes, which contained 16 SARS-CoV-2 and 9 SARS-related sequences (SI Appendix, Figure S9). The average pairwise sequence identity of 16 SARS-CoV-2 and a total of 25 genomes was 99.9% and 89.6%, respectively. LinearTurboFold requires approximately 13 hours and 43 GB on 25 genomes.

In order to evaluate the reliability of LinearTurboFold predictions, we first compare them with the SHAPE guided model of Huston et al. (32), which is used for regions with well-characterized structures in β-coronavirus. For the extended 5'and 3'UTR, LinearTurboFold's prediction is close to the SHAPE guided structure (Figure 3A and B), that is, both recognize the stem loop (SL) 1 to 2 and 4 to 7'UTR in the extension 5 And the raised stem loop (BSL), SL1, and the long raised stem for the hypervariable region (HVR), including the stem loop II-like motif (S2M) in the 3'UTR. Interestingly, in our model, the high unpaired probability of the stem in SL4b indicates the possibility of a single strand as an alternative structure, which is supported by experimental studies (33, 36). In addition, the compensation mutation LinearTurboFold found in UTR strongly supports the evolutionary conservation of the structure (Figure 3A).

The secondary structure prediction of SARS-CoV-2 extends 5'and 3'UTR. (A) LinearTurboFold forecast. Nucleotides and base pairs are colored by unpaired probability and base paired probability, respectively. The compensation mutations extracted by LinearTurboFold are annotated with the replacement pairs in the red box (see SI appendix, Table S2 for more completely conservative pairs with covariant changes). (B) The SHAPE boot model of Huston et al. (32) (window size 3,000 nt sliding 300 nt, maximum pairing distance 500 nt). Nucleotides are colored reactively by SHAPE. The dashed box contains the different structures between A and B. Our model is close to Huston et al. (32), but the main difference is that LinearTurboFold prediction involves boxes in 5'and 3'UTR (solid A), which is exactly the same as the interaction detected by Ziv et al. (39) Use COMRADES experimental technique (C). The local folding method used in the previous experimentally guided model was unable to capture this remote interaction (Figure 1B). The similarity between models A and B and the complete agreement between A and C indicate that our computer simulation method of folding multiple homologues can obtain results similar to or even more accurate than experimentally guided single-genome predictions. As a negative control (SI Appendix, Figure S10), the RNAalifold method cannot predict this long-range interaction. Although the single sequence folding algorithm (LinearPartition) predicts the long-distance 5'-3' interaction, the position is different from the LinearTurboFold model and the experimental results of Ziv et al. (39).

The most important difference between the prediction of LinearTurboFold and the experimental guidance model of Huston et al. (32) is that LinearTurboFold found 5'UTR (SL3, 60 to 82 nt) and 3'UTR (final area, 29,845 to 29,868 nt). They In the model of Huston et al. (32), it folds itself locally. Interestingly, this 5'-3' interaction perfectly matches the interaction found in pure experimental work by Ziv et al. (39) Use COMRADES technology to capture remote base pairing interactions (Figure 3C). These end-to-end interactions have been fully confirmed by theoretical and experimental studies (54⇓ –56). They are common in natural RNA, but are far beyond the local folding method used in the existing SARS-CoV-2 modeling studies. Range of secondary structure (32⇓ ⇓ –35). In contrast, LinearTurboFold can globally predict secondary structure without any window size or base pairing distance restrictions, enabling it to discover long-distance interactions across the entire genome. The similarity between our prediction and experimental work shows that our computer method of folding multiple homologues can obtain results similar to or even more accurate than experimentally guided single-genome predictions.

Due to three factors, LinearTurboFold can model these end-to-end interactions: 1) linearization, 2) LinearPartition's better modeling capabilities on long sequences and long distance pairs, and 3) homology folding and soft alignment. Linearization not only enables LinearTurboFold to scale to longer sequences, but also improves the accuracy of remote interaction modeling that benefits from LinearPartition (46). In addition, homology folding is also critical. We observed that LinearPartition can simulate the same end-to-end interaction detected by Ziv et al. (39) 8 of 25 sequences (4 of 16 SARS-CoV-2 and 4 of 9 SARS-related sequences; SI appendix, Figures S12A and S13, left column). However, for other sequences, LinearPartition either cannot predict end-to-end interactions or predicts them in the wrong position. On the other hand, LinearTurboFold propagates the correct structural information from these eight sequences to other homologues, resulting in all SARS-CoV-2 sequences having the same end-to-end pair (SI appendix, Figures S12B and S13, right column). In contrast, the align-and-fold method (MAFFT + RNAalifold) relies on input hard alignment and predicts a single consensus structure of all homologs, and cannot predict this long-range interaction (SI appendix, Figure S10B).

Frameshift stimulation element (FSE) is another area with obvious characteristics. For the extended FSE area, the LinearTurboFold prediction consists of two substructures (Figure 4A): the 5'part includes an attenuator hairpin and a stem, which are connected by a long inner ring (16 nt), including slippage, and 3' The part includes three stem loops. We observe that the structure of the 5'part of our prediction is consistent with the structure in the experimentally guided model (32, 33, 35) (Figure 4 BD). In the attenuator hairpin, the small inner loop motif (UU) was previously selected as a small molecule binder, which can stabilize the folded state of the attenuator hairpin and weaken the frameshift (41). For the long inner loop including the smooth site, we will show its high accessibility and conservation in the next section (Figure 5), which makes it a perfect candidate for drug design. For the 3'area of ​​FSE, LinearTurboFold successfully predicted stems 1 to 2 of the canonical three-stem pseudoknot (40) (but missed stem 3) (Figure 4E). Compared with experimental guidance models (32, 33, 35), our predictions are closer to the canonical structure (Figure 4 BD); one such model (Figure 4B) recognizes false knots (stem 3), but stem 2 is open . Please note that the experimental guidance models for all these FSE regions are estimated for specific local regions. Therefore, the model is sensitive to context and region boundaries (32, 35, 57) (see SI appendix, Figure S11D-F, for an alternative structure of Figure 4 BD with different regions). In contrast, LinearTurboFold does not encounter this problem because there is no global folding of local windows. Except for SARS-CoV-2, we noticed that the estimated structure of the SARS-CoV-1 reference sequence from LinearTurboFold (Figure 4F) is similar to SARS-CoV-2 (Figure 4A), which is consistent with the observation that the structure of the FSE region is in β Coronavirus is highly conserved (40). Finally, as a negative control, compared with the LinearTurboFold prediction, the single-sequence folding algorithm (linear partition in Figure 4G) and the alignment and then folding method (RNAalifold in the SI appendix, Figure S11G) predicted completely different structures (Figure 4G). 4A) (LinearPartition/RNAalifold did not find 39/61% of the pairs from the LinearTurboFold model).

(AD) SARS-CoV-2 Extended Frameshift Stimulation Element (FSE) region (13,425 to 13,545 nt) secondary structure prediction. (A) LinearTurboFold forecast. (BD) Experimental guidance predictions from the literature (32, 33, 35). Due to the use of the local folding method (SI appendix, Figure S11), these predictions are sensitive to context and region boundaries. (E) The typical pseudoknot structure of the comparative analysis between SARS-CoV-1 and SARS-CoV-2 genomes (40). For the FSE 5'area (attenuator hairpin, inner ring with slippage and stem) shown in the dashed box, LinearTurboFold prediction (A) is consistent with BD; for the 3'area of ​​FSE shown in the dashed box, it is the same as BD In contrast, our prediction (predicted stems 1 to 2 but lacks stem 3) is closer to the canonical structure in E. (F) LinearTurboFold forecast of SARS-CoV-1. (G) SARS-CoV-2 single sequence folding algorithm (LinearPartition) prediction is very different from LinearTurboFold. As another negative control, the align-and-fold method (RNAalifold) predicted a rather different structure (SI appendix, Figure S11G). (H) Five examples from 59 fully conserved structures in 25 genomes (SI Appendix, Table S3), 26 of which are different from previous work (31, 32).

Description of accessible and conservative areas recognized by LinearTurboFold. (A and B) With the help of both alignment and folding, LinearTurboFold is used to identify structurally conserved accessible regions. In all 16 SARS-CoV-2 genomes, areas with a length of at least 15 nt and an accessibility of at least 0.5 are colored against a blue background. The structure is coded in dot bracket notation. "(" and ")" indicate nucleotide pairing in the 3'and 5'directions, respectively. "." Indicates unpaired nucleotides. The locations of the three different subfamilies (SARS-CoV-2, SARS-CoV-1 and BCoV) where mutations occur compared to the SARS-CoV-2 reference sequence are underlined. (C) Accessible and conserved regions are not only accessible in the SARS-CoV-2 genome (pink circles), but also conserved (at the sequence level) in the SARS-CoV-2 and SARS-related genomes (green circles) of. (D) Two examples of 33 accessible and conserved regions discovered by LinearTurboFold. Areas 16 and 29 correspond to the accessible areas in A and B, respectively. Area 16 is also the long inner ring, including the slippage in the FSE area (H). The conservation of these regions in the nine SARS-related genomes is the number of mutation sites. The conservation of approximately 2 million SARS-CoV-2 data sets is shown in terms of the average sequence identity and the percentage of perfect matches with the reference sequence. (E and F) Even if the sequence identity is high (gray rectangle with diagonal lines), the single sequence folding algorithm can predict very different structures. These two regions are completely conserved in the SARS-CoV-2 genome, but due to mutations outside the region, they still fold into different structures. In contrast, LinearTurboFold folds all sequences into the same structure due to homologous signals in the corresponding regions in A and B. (G) The positions of these 33 regions (red bars) in the entire genome (SI Appendix, Table S5). All accessible and conserved regions are potential targets for siRNA, ASO, CRISPR-Cas13 gRNA and RT-PCR primers.

In addition to the well-studied UTR and FSE regions, LinearTurboFold found 50 conserved structures with the same structure in 25 genomes. Compared with previous studies, 26 regions are different (31, 32) (Figure 4H and SI Appendix, Table S3). These different structures are potential targets for small molecule drugs (41) and ASO (36, 58). LinearTurboFold can also restore fully conserved base pairs through compensatory mutations (SI appendix, Table S2), which means that highly conserved structural regions of its function may not have been explored. We provide a complete multiple sequence alignment and prediction structure of 25 genomes from LinearTurboFold (data set S1; see the SI appendix for the format, Figure S14).

Studies have shown that siRNA silencing efficiency, ASO inhibition efficiency, CRISPR-Cas13 knockdown efficiency and RT-PCR primer binding efficiency are all related to the accessibility of the target region (43⇓ –45, 59), that is, the target site is completely unpaired . However, most existing work on designing siRNA, ASO, CRISPR-Cas13 gRNA and RT-PCR primers does not take this feature into account (60, 61) (SI Appendix, Table S4). Here, LinearTurboFold can provide more principled design candidates by identifying accessible regions of the target genome. In addition to accessibility, emerging variants around the world have reduced the effectiveness of existing vaccines and test kits (SI Appendix, Table S4), indicating that sequence conservation is another key aspect of treatment and diagnostic design. LinearTurboFold is a tool for structural alignment and homology folding. It can identify (in order) conserved and (structurally) accessible regions. It not only utilizes SARS-CoV-2 variants, but also homology Sequences, such as SARS-CoV-1 and bat coronavirus genomes, identify conserved regions from a historical and evolutionary perspective.

In order to obtain unstructured regions, Rangan et al. (31) A threshold is imposed on the unpaired probability of each location, which is a rough approximation because the probabilities are not independent of each other. In contrast, the widely used random sampling algorithm (50, 62) constructs a representative set of structures by sampling independent secondary structures according to the probability in the Boltzmann distribution. Therefore, the accessibility of an area can be approximated as the fraction of the sampling structure where the area is a single chain. LinearTurboFold uses LinearSampling (50) to generate 10,000 independent structures for each genome according to the modified partition function after iterative refinement (Figure 1A, Module 5) and calculates the accessibility of regions with a length of at least 15 nt. Then, we defined accessible regions with at least 0.5 accessibility in all 16 SARS-CoV-2 genomes (Figure 5A and B). We also measured the free energy of opening the target area [i,j] (63), denoted as ΔGu[i,j]=−RT(log Zu[i,j]−log Z)=−RTlog Pu[i,j ], where Z is the partition function that sums up the equilibrium constants of all possible secondary structures, Zu[i,j] is the partition function of all structures in the region [i,j] that are completely unpaired, and R is the universal gas constant , T is the thermodynamic temperature. Therefore Pu[i,j] is the unpaired probability of the target area, which can be approximated by sampling through s0/s, where s is the sample size and s0 is the number of samples in the target area that are single-stranded. Regions where the free energy change is close to zero require less free energy to open, so it is easier to bind to siRNA, ASO, CRISPR-Cas13 gRNA and RT-PCR primers.

Next, in order to identify highly conserved regions in the SARS-CoV-2 and SARS-related genomes, compared with the SARS-CoV-2 reference, we require these regions to contain at most three mutation sites on the 9 SARS-related genomes Sequence, because historically conserved sites are unlikely to change in the future (64), and the average sequence identity with the reference sequence on the large SARS-CoV-2 data set is at least 0.999 (here we used about 2 million SARS-CoV-2 genome submitted to GISAID as of June 30, 2021; method, SARS-CoV-2 dataset). Finally, we identified 33 accessible and conserved regions (Figure 5G and SI appendix, Table S5), which are not only structurally accessible in the SARS-CoV-2 genome, but also in SARS-CoV-2 and The SARS-related genome is also highly conserved (Figure 5C). Because specificity is also a key factor affecting siRNA efficiency (65), we used BLAST against the human transcript data set to search for these regions (SI appendix, Table S5). Finally, we also listed the GC content of each region. Among these regions, region 16 corresponds to the inner loop containing the smooth site in the extended FSE region, and it is conserved at the structural and sequence levels (Figure 5 D and H). In addition to the SARS-CoV-2 genome, the SARS-CoV-1 reference sequence (NC_004718.3) and the bat coronavirus (BCoV) (MG772934.1) and other SARS-related genomes also formed similar structures around the slippage (Figure. 5A). By removing the constraints on the protection of SARS-related genomes, we identified 38 additional candidate regions (SI Appendix, Table S6), which are accessible but highly conserved only on SARS-CoV-2 variants.

We also designed a negative control by using LinearSampling to separately analyze the SARS-CoV-2 reference sequence, which can also predict the accessible area. However, with the exception of one region in the M gene, these regions are not structurally conserved in the other 15 SARS-CoV-2 genomes, resulting in very different accessibility (SI Appendix, Table S7). The reason for this difference is that even with high sequence identity (over 99.9%), the single sequence folding algorithm still predicts a very different structure of the SARS-CoV-2 genome (Figure 5 E and F). These two regions (in the nsp11 and N genes) are completely conserved in the 16 SARS-CoV-2 genomes, but due to mutations outside the region, they still fold into completely different structures; therefore, the accessibility is either very low ( nsp11), or in a wide range (N) (Figure 5D). Instead, by folding each sequence that has a base-pairing propensity inferred from all homologous sequences to solve this problem, the LinearTurboFold structure predictions are more consistent with each other, so conserved structures can be detected (Figure 5A and B).

The continuous emergence of new SARS-CoV-2 variants is reducing the effectiveness of existing vaccines and test kits. In order to solve this problem, there is an urgent need to identify conservative structures as promising targets for treatment and diagnosis, although current and future mutations are still effective. Here, we propose LinearTurboFold, an end-to-end linear time algorithm for structural alignment and conservative structure prediction of RNA homologues. As far as we know, this is the first one that can be extended to full-length SARS. Joint Folding Alignment Algorithm-CoV-2 genome does not have any restrictions on base pairing distance. We also proved that LinearTurboFold significantly improves the accuracy of secondary structure prediction and alignment accuracy that is equal to or higher than that of all benchmarks.

Unlike the existing work on SARS-CoV-2 that uses local folding and single-sequence folding solutions, LinearTurboFold can perform unprecedented global structural analysis of the SARS-CoV-2 genome; in particular, it can capture long-range interactions, especially the entire The interaction between 5'and 3'UTR in the genome, which is a perfect match with recent pure experimental work. On a set of SARS-CoV-2 and SARS-related homologues, LinearTurboFold not only determined the conserved structure supported by compensatory mutations and experimental studies, but also identified accessible and conserved regions as effective small molecule drugs and siRNAs. , ASO, CRISPR-Cas13 gRNA and RT-PCR primers. LinearTurboFold is widely applicable to the analysis and full-length genome analysis of other RNA viruses (influenza, Ebola, HIV, Zika virus, etc.).

We use pair-HMM (pair-HMM) to align two sequences (51, 66). The model includes three actions (h): aligning two nucleotides (ALN) from two sequences, inserting a nucleotide in the first sequence and no corresponding nucleotide in the other sequence (INS1 ), and insert a nucleotide in the second sequence without the corresponding nucleotide in the first sequence (INS2). Then we define A(x,y) as the set of all possible alignments of two sequences, an alignment a∈A(x,y) is the step (h, i, j) with m + 2 steps Sequence, where (h, i, j) represents the alignment step of action h at position pair (i, j). Therefore, for the l-th step al=(hl,il,jl)∈a, the values ​​of il and jl depend on the position il-1 and jl-1 of the action hl and al-1: al={(ALN, il−1 +1,jl−1+1),hl=ALN(INS1,il−1+1,jl−1),hl=INS1(INS2,il−1,jl−1+1),hl=INS2 (ALN, 0,0) as the first step, and (ALN,|x|+1,|y|+1) as the last step. For two sequences {ACAAGU, AACUG}, a possible alignment {–ACAAGU, AAC–UG} can be specified as {(ALN,0,0)→(INS2,0,1)→(ALN, 1,2 )→( ALN,2,3)→(INS1,3,3)→(INS1,4,3)→(ALN,5,4)→(ALN,6,5)→(ALN,7,6)) , Where a gap symbol (-) indicates a nucleotide insertion in another sequence at the corresponding position (SI Appendix, Figure S3). The action hl in each step (hl, il, jl) corresponds to the line segment starting from the previous node (il-1, jl-1) and stopping at the node (il, jl). Therefore, when hl is INS1, INS2, or ALN, the line segment is horizontal, vertical, or diagonal toward the upper right corner (SI Appendix, Figure S3).

We initialize the first step with the state ALN with probability 1; therefore pπ(ALN)=1. pt(h2|h1) is the transition probability from state h1 to h2, and pe((c1,c2)|h1) is the probability that the state h1 sends out the character pair (c1, c2), and its value comes from {A, G, C , U, -}. Both launch and transfer probabilities come from TurboFold II. The function e() generates a character pair based on al and two sequences of nucleotides: e(x,y,al)={(xil,yjl),hl=ALN(xil,-),hl=INS1(-, yjl),hl=INS2, where xi and yj are the i-th and j-th nucleotides of the sequence x and y, respectively. Please note that the first step a0=(ALN,0,0) and the last step am+1=(ALN,|x|+1,|y|+1) are not emitted.

We indicate that the forward probability αi,jh includes the probability of partial alignment of x and y to positions i and j and all alignments through steps (h, i, j): αi,jh=∑a∈A(x,y)∃ k,ak=(h,i,j)p(x,y,a[:k])=pπ(h0)·∏l=1kpt(hl|hl−1)pe(e(x) ,y,al )|hl), where a[:k] represents the partial alignment from the starting node to the k-th step, and ak=(h,i,j). For example, α3,3ALN, α3,3INS1 and α3,3INS2 correspond to the area enclosed by the blue dotted line (SI Appendix, Figure S3B-D). Similarly, the backward probability βi,jh assembles the partial alignment probability a[k+1:] from the (k+1)th step to the last step: βi,jh=∑a∈A(x,y)∃ k, ak=(h,i,j)p(x,y,a[k+1:])={∏l=k+1mpt(hl|hl−1)pe(e(x,y,al) |hl )}·Pt(hm+1|hm).

For example, β3,3ALN, β3,3INS1 and β3,3INS2 are the areas enclosed by yellow dotted lines (SI appendix, Figure 3 BD). Therefore, the probability of observing two sequences p(x,y) is α|x|+1,|y|+1ALN or β0,0ALN.

If the alignment path passes through the node (i, j), the nucleotide positions i and j in the two sequences x and y are said to be coincident in alignment a (denoted as i∼j) (51). Since the node (i, j) can be reached by three actions H={ALN,INS1,INS2}, the coincidence probability of the position pair (i, j) given two sequences is p(i∼j|x,y )=1p (x,y)∑a∈A(x,y)∃h,(h,i,j)∈ap(x,y,a), [1] where p(x,y,a) is Align the two sequences of a, p(x,y) is the probability of observing the two sequences, that is, the sum of the probabilities of all possible alignments: p(x,y)=∑a∈A(x,y) p(x,y,a).

The coincidence probability of positions i and j (Equation 1) can be calculated by p(i∼j|x,y)=∑hαi,jh·βi,jhα|x|+1,|y|+1ALN.

Unlike the previous method (51) filling all nodes in the alignment matrix by column (SI Appendix, Figure S3), LinearAlignment scans the matrix according to the number of steps s, which is i and j (s = i+j) for x [1,i] and y[1,j] are partially aligned. As shown in the pseudo code (SI appendix, Figure S4), the forward phase starts from the node (0, 0) in the state ALN with probability 1, and then iterates from 0 to the number of steps s to |x|+|y| -1. For each step number s in H with a specific state h, we first collect all nodes (i, j) with step number s and αi,jh, which means that the position pair (i, j) has passed state h before . Then each node switches to the next node through their state, and updates the corresponding forward probabilities αi+1,jINS1,αi,j+1INS2 and αi+1,j+1ALN respectively.

The current alignment algorithm is still an exhaustive search algorithm, and all the time and space of |x|×|y| are quadratic. node. In order to reduce running time, LinearAlignment uses a beam search heuristic algorithm (48) and reserves a limited number of promising nodes at each step. For each step number s with state h, LinearAlignment first applies the beam search method on B(s, h), B(s, h) is the set of all nodes (i, j) with step number s, and there is αi,jh (SI appendix, Figure S4, line 6). This algorithm only saves the first balign nodes with the highest forward score in B(s, h), and these nodes are then allowed to transition to the next state. Here balign is the beam size specified by the user, and the default value is 100. There are O(balignn) nodes in total, because the length of s is |x|+|y| and each step counts to keep balign nodes. For simplicity, we use the alignment example (SI appendix, Figure S3A) to demonstrate the topology order and beam search method, while the forward-backward algorithm uses the same idea by summing the probabilities of all possible alignments.

After the forward phase, the backward phase (SI appendix, Figure S4) is executed in linear time to automatically calculate the coincidence probability, because only the linear number of nodes in B(s, h) are stored. Therefore, by pruning low-scoring candidates at each step of the forward algorithm, we reduce the running time from O(n2) to O(balignn) to align the two sequences. For k input homologous sequences, LinearTurboFold calculates the posterior coincidence probability of each pair of sequences through LinearAlignment, which takes O(k2balignn) running time in total.

In order to encourage the alignment of pairs in accordance with the estimated secondary structure, LinearTurboFold predicts structural alignment by combining the conformation of the secondary structure. PMcomp (67) first proposed a matching score to measure the structural similarity of position pairs between a pair of sequences, and TurboFold II adjusted it as a priori. Based on the base pair probability Px(i,j) estimated from the distribution function of sequence x, position i can be paired or unpaired with upstream or downstream bases, and the corresponding probability is Px, >(i)=∑j<iPx (i,j), Px,<(i)=∑j>iPx(i,j) and Px,o(i)=1-Px,>(i)-Px,<(i). The matching scores mx,y(i,j) of the two positions i and j from the two sequences x and y are based on the probabilities of these three structural tendencies from the last iteration (t –1): mx,y(t) (i,j)=α1[Px,>(t−1)(i)·Py,>(t−1)(j)Px,<(t−1)(i)·Py,<(t−1 ) (j)] +α2Px,o(t−1)(i)·Py,o(t−1)(j)+α3, where α1, α2 and α3 are the weight parameters trained in TurboFold II. The forward-backward phrase uses the matching score as the prior integral when aligning two nucleotides (SI Appendix, Figure S4, lines 10 and 12).

Before the HMM alignment calculation, TurboFold II pre-calculates the matching scores of all O(n2) position pairs of the sequence pair. However, after applying beam trimming in LinearAlignment, only the linear number of pairs O(balignn) survived. In order to reduce redundant time and space usage, LinearTurboFold calculates the corresponding matching scores of coincident pairs at the first visit in LinearAlignment. In general, for k homologous sequences, LinearTurboFold applies the beam search heuristic to the paired HMM alignment and only calculates the matching score of the position, and the running time of the entire module of the paired posterior coincidence probability calculation is from O(k2n2) Reduce to O (k2balignn) required pairs.

In order to update the partition function of each sequence with structural information from homologs, TurboFold (28) introduces extrinsic information to simulate the base pairing tendency caused by other sequences in the input set S. The extrinsic information of a. The base pair (i, j) in the sequence x (i, j) maps the estimated base pairing probabilities of other sequences to the target sequence through the coincident nucleotides between each pair of sequences: ∑y ∈{S∖x}(1−sx,y) ∑k,lpy(t−1)(k,l)·px,y(t)(i∼k)·px,y(t)(j∼l ), where py(t−1)(k,l) is the base pair probability of the base pair (k, l) in the sequence y from the (t-1)th iteration. px,y(t)(i∼k) and px,y(t)(j∼l) are the posterior coincidence probability of position pairs (i, k) and (j, l), respectively, from the (t) time Iteration. The extrinsic information ex(t)(i,j) first adds all the base pair probabilities of the comparable pairs from another sequence to the coincidence probability, and then iterates all other sequences. sx,y is the sequence identity of the sequence x and y. Sequences with low identity contribute more to external information than sequences with higher identity. Sequence identity is defined as the fraction of nucleotides that are aligned and identical in the alignment.

The classic distribution function algorithm scales three times with the sequence length. Slowly extend it to longer sequences. In order to solve this bottleneck, our recent LinearPartition (46) algorithm approximates the partition function and basis pair probability matrix calculation in linear time. Compared with the traditional cubic partition function calculation, LinearPartition is significantly faster and has a better correlation with the real structure. Therefore, LinearTurboFold uses LinearPartition to predict base pair probabilities instead of the traditional O(n3) time partition function.

TurboFold introduces the external information ex(t)(i,j) in the distribution function as the pseudo free energy term of each base pair (i, j). Similarly, in LinearPartition, for each interval [i,j], that is, the subsequence xi...xj and its associated partition function Q(i, j), modify the partition function to Q~(i,j)=Q (i,j)ex(t)(i,j)λ If (xi, xj) is an allowed pair, where λ represents the contribution of external information relative to internal information. Specifically, at each step j, in all possible spans [i,j], where xi and xj are paired, we replace the original partition function with Q(i,j)ex(t)(i,j) Q(i, j) )λ is multiplied by external information. Then LinearTurboFold applies the beam pruning heuristic on the modified partition function Q~(i,j) instead of the original partition.

Similarly, TurboFold II obtains all the extrinsic information of O(n2) base pairs before the calculation of the allocation function of each sequence, while only a linear number of base pairs are stored in LinearPartition. Therefore, LinearTurboFold only needs the extrinsic information of those promising base pairs accessed in LinearPartition. In general, for k homologous sequences, LinearTurboFold applies beam size folding to the allocation function calculation and only calculates the base pairs used for preservation.

After multiple iterations, TurboFold II uses probabilistic consistency transformation to construct multiple sequence alignments, generates a guide tree, and performs asymptotic alignment on the paired posterior coincidence probability (30). By discarding aligned pairs whose probability is less than the threshold (the default is 0.01), the whole process is speeded up with the help of a sparse matrix. Since LinearAlignment uses the beam search method and only saves a linear number of coincident pairs, the MSA generation in LinearTurboFold consumes linear running time directly based on the sequence length.

The estimated base pair probabilities are input into downstream methods to predict secondary structure. In order to maintain the end-to-end linear time property, LinearTurboFold uses ThreshKnot (49), which is a threshold version of ProbKnot (68), and only considers base pairs with the probability of exceeding the threshold θ (default θ=0.3). We evaluated the performance of ThreshKnot and the Maximum Expected Accuracy (MEA) structure with different hyperparameters (θ and γ). On the sampled RNAStrAlign training set, ThreshKnot is closer to the upper right corner than MEA, which indicates that ThreshKnot always has a higher sensitivity than MEA under a given positive predictive value (PPV) (SI Appendix, Figure S8).

Four data sets were constructed and used to measure efficiency and scalability. In order to evaluate the efficiency and scalability of LinearTurboFold with sequence length, we collected homologous RNA sequence groups with a sequence length ranging from 200 to 29,903 nt and a fixed group size of 5. Sequences were obtained from the RNAStrAlign dataset (27), the comparative RNA Web (CRW) site (69), the Los Alamos HIV database (https://www.hiv.lanl.gov/) and the SARS-related β-coronavirus (SARS-related ) (53). RNAStrAlign is aggregated and released together with TurboFold II. It is an RNA alignment and structure database. The sequences in RNAStrAlign are classified into families, ie collections of homologs, and some families are further divided into subfamilies. Each subfamily or family includes multiple sequence alignments and true structures of all sequences. Randomly select 20 groups from small subunit ribosomal RNA (Alphaproteobacteria subfamily), SRP RNA (protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA family, each with 5 homologs . For longer sequences, we sampled five sets of 23S rRNA (sequence length from 2,700 to 2,926 nt) from CRW sites, and HIV-1 gene sequence from Los Alamos HIV database (sequence length from 9,597 to 9,738 nt) ), and SARS related sequence (sequence length from 29,484 to 29,903 nt). All sequences in a group belong to the same subfamily or subtype. We sampled 5 groups for each family, and got a total of 35 groups. Due to runtime and memory limitations, we did not run TurboFold II on the SARS-CoV-2 group (Figure 2A and B).

In order to evaluate the running time and memory usage of LinearTurboFold with a group size, we fixed the sequence length at about 1,500 nt, and selected the small subunit ribosomal RNA (Alphaproteobacteria subfamily) with group sizes of 5, 10, 15 and 20 Five sets of sequences were sampled, respectively (Figure 2 C and D). We used a Linux machine (CentOS 7.7.1908) with 2.30-GHz Intel Xeon E5-2695 v3 CPU and 755 GB of memory and gcc 4.8.5 for benchmark testing.

We constructed a test set from the RNAStrAlign dataset to measure and compare the performance between LinearTurboFold and other methods. Randomly select 5 homologous sequences from the small subunit rRNA (Alphaproteobacteria subfamily), SRP RNA (protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA family in the RNAStrAlign data set 60 groups of input sequences. We deleted the sequence of small subunit rRNA shorter than 1,200 nt to filter out subdomains, and according to the TurboFold II paper, deleted the sequence of SRP RNA shorter than 200 nt to filter out less reliable sequences. We resampled the test set five times and showed the average PPV, sensitivity, and F1 score of the five samples (Figure 2 F and G).

A training set of RNAStrAlign was constructed to compare the accuracy between MEA and ThreshKnot. From the 5S ribosomal RNA (eubacterial subfamily), group I introns (IC1 subfamily), transfer messenger RNA and tRNA families in the RNAStrAlign data set, 40 groups of three, five, and seven homologs were randomly selected. We choose θ = 0.1, 0.2, 0.3, 0.4, and 0.5 for ThreshKnot, and γ = 1, 1.5, 2, 2.5, 3, 3.5, 4, 8, and 16 for MEA. We report the average secondary structure prediction accuracy (PPV and sensitivity) of all training series (SI appendix, Figure S8).

The Sankoff algorithm (15) uses dynamic programming to fold and align two or more sequences at the same time. For k input sequences with an average length of n, it requires O(n3k) time and O(n2k) space. LocARNA (16) and MXSCARNA (18) are Sankoff style algorithms.

LocARNA takes O(n2(n2+k2)) time and O(n2+k2) space by restricting the alignable area. MXSCARNA aligns multiple sequences step by step, as an extension of the pairwise alignment algorithm SCARNA (70), with an improved scoring function. SCARNA first aligns stem segment candidates, and then removes inconsistent matches in post-processing to generate sequence alignments. MXSCARNA reduces the running time to O(k3n2) and the space to O(k2n2), and the search space for folding and alignment is limited. Both MXSCARNA and LocARNA use pre-calculated base pair probabilities for each sequence as structural input. All benchmark tests use the default options and hyperparameters run on the RNAStrAlign test set. TurboFold II iterates 3 times, and then predicts the secondary structure through MEA (γ = 1). LinearTurboFold also runs three iterations using the default beam size (balign=bfolding=100) in LinearAlignment and LinearPartition, and then uses ThreshKnot (θ=0.3) to predict the structure.

We use the paired two-tailed permutation test (71) to measure significant differences. By convention, the number of repetitions is 10,000, and the significance threshold α is 0.05.

We used two large SARS-CoV-2 data sets. The first data set is used to map a representative sample of the most diverse SARS-CoV-2 genome. We downloaded all genomes submitted to GISAID (52) before December 29, 2020 (downloaded on December 29, 2020), and filtered out low-quality genomes (more than 5% of unknown characters and degenerate bases). Shorter than 29,500 nt, or frame errors) in the coding region), and compared with the SARS-CoV-2 reference sequence (NC_0405512.2), we also discarded genomes with more than 600 mutations (72). After preprocessing, the data set contains approximately 258,000 genomes. In order to identify representative sample groups with more variable mutations, we designed a greedy algorithm to select the 16 most diverse genomes that were found at least twice in 258,000 genomes. The general idea of ​​the greedy algorithm is that, compared with the selected sample, the genome with the most new mutations is selected one by one, which only contains the initial reference sequence.

The second, larger data set is used to evaluate the area protection for updated variants. We performed the same preprocessing (downloaded on July 25, 2021) on the first dataset of all genomes submitted to GISAID before June 30, 2021. This produced a data set of approximately 2 million genomes, which was used to evaluate the protection in Figure 5 and Tables S4-S6 in the SI Appendix.

The code, data and complete results of our 25 SARS-CoV-2 and SARS-related genomes are published on GitHub, https://github.com/LinearFold/LinearTurboFold, and our web server is located at http://linearfold.org/ Linear turbo folding. The previously published data were used in this work (27, 53).

We thank the anonymous reviewers for their suggestions, Professor Zhang Qiangfeng (Tsinghua University) for the discussion, and Evan Yang (Emory University) for his contribution to the web server. This work was partially supported by the National Institutes of Health grant R01 GM132185 (to DHM) and National Science Foundation grants IIS-1817231 and IIS-2009071 (to LH).

Author contributions: research on SL, HZ, DHM and LH design; SL and HZ research; SL, HZ, LZ, KL, BL, DHM and LH analysis data; SL, HZ, DHM and LH written papers; KL production A web server was created; DHM and LH conceived and directed the project.

The author declares no competing interests.

This article is directly contributed by PNAS.

This article contains online support information https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2116269118/-/DCSupplemental.

↵*Theoretically, the alignment part requires O(k2n2) space. However, in practice, TurboFold II discards the positions whose alignment coincidence probability is less than the threshold, and only keeps the linear number of positions (51).

↵†The average sequence identity on approximately 2 million data sets (downloaded on July 25, 2021) is 0.9987.

This open access article is distributed under Creative Commons Attribution 4.0 (CC BY).

Thank you for your interest in advertising on PNAS.

Note: We only ask you to provide your email address so that the people you recommend the page to know that you want them to see it and that it is not spam. We do not capture any email addresses.

Feedback privacy/legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490. PNAS is a partner of CHORUS, COPE, CrossRef, ORCID and Research4Life.