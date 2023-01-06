Analysis of Publicly Deposited SARS-CoV-2 Genome Sequences

In total, 2,212,827 sequences were retrieved from GISAID (as of June 30, 2021, last accessed https://www.gisaid.org/ July 5, 2021. ) downstream analysis confirmed that all his GISAID entries corresponded to a human host and the exact collection date (YYYY-mm-dd) was after his December 1, 2019.This yielded 2,128,574 SARS-CoV-2 genome sequences (including 1291 unique PANGO lineages and 11,805 unique Spike mutations) from GISAID16 It spans 188 countries and territories. To exclude potential sequencing artifacts, we excluded mutations present in <100 sequences, resulting in 1045 unique spike protein mutations.

Identification of surge-associated SARS-CoV-2 mutations

To identify mutations that are temporally associated with the surge in COVID-19 cases during the pandemic, we assessed monthly mutation prevalence and positive test results at 3-month intervals in each country. For each of the 1045 mutations, the monthly mutation prevalence was calculated for a given country as follows:

$$Mutation \,Prevalence = \frac{Number \;of\; Sequence \;\;a \;Mutation \;in \;a \;Given\;Month}{{Total\;Number\ ;of\;sequence \;deposit \;in\;that\;month}} \times 100$$

Positive PCR test data obtained from OWID resource17,31 (obtained from https://github.com/owid/covid-19-data/tree/master/public/data June 30, 2021). By country, the monthly positive test rate was calculated as follows:

$$Test \;Positivity = \frac{{New \;cases \;in\; \;given\; month \left( {smoothened} \right)}}{{New\; test \;in \; The \;month \left( {smoothened} \right)}} \times 100$$

To identify surge-associated mutations, monthly mutation prevalence (for each mutation) and monthly test positives were increased (monotonically), decreased (monotonically), or slid over the course of the pandemic. Classified as mixed at monthly intervals. Mutations that had a monotonically increasing prevalence during this period concurrent with a monotonically increasing test positivity were defined as ‘surge-associated mutations’. There were 89 such mutations.

Surge-associated mutations versus mutations in CDC variants of interest and concern

To test the value of our method, we obtained a set of CDC variants of interest as of July 13, 2021.8At this point, four variants of concern (Alpha_B.1.1.7, Beta_B.1.351, Delta_B.1.617.2, and Gamma_P.1) and seven important variants (Epsilon_B.1.427, Epsilon_B.1.425 , Eta_B.1.525, Iota_B). .1.1526, Kappa_B.1.617.1, B.1.617.3, Zeta_P.2), no significant variants. From the 11 variants classified, there were 59 unique mutations (53 positions), of which 18 were found only in the variant of interest, 29 were found only in the variant of interest, and 12 were found in the variant of interest. and variants of concern. After identifying surge-associated mutations as described above, we determined the proportion of mutations containing CDC classification variants captured by this approach.

Mutant type assessment for enrichment of surge-associated mutations

After identifying 92 surge-associated mutations, we tested whether any of the contributing mutation types (deletion, insertion, or substitution) were enriched for surge-associated mutations. To that end, we created a 2 × 3 table showing the number of surge-associated and non-surge-associated mutations in each category. To determine whether one or more groups showed statistically significant enrichment, use chi-square p Values ​​were calculated using the scipy.stats.chi2_contingency function of the scipy package (1.7.0) in Python v3.9.5. Post hoc Fisher’s test was performed by constructing a 2 × 2 contingency table to compare each mutation type with all other mutation types. Odds ratios and corresponding 95% confidence intervals were then calculated using the scipy.stats.fisher_exact function and statsmodels.stats.contingency_tables.Table2 × 2, respectively, in Python v3.9.5.

Identification of recurrently deleted regions of Spike proteins

Recurrently deleted regions (RDRs) account for 90% of all Spike protein deletions, according to 146,795 SARS-CoV-2 sequences deposited with GISAID between December 1, 2019 and October 24, 2020 It was previously defined as four sites within the NTD where more than one occurs.15.

To formally identify the RDRs that emerged over the course of the pandemic, we analyzed the monthly distribution of the number of Spike protein deletions for each amino acid (that is, the number of sequences in which deletions of a particular amino acid were observed in a given month). For each month, we calculated the 95th percentile of the deletion number distribution.Then I put each residue into a bucket R. A category (Yes, No, Possible ). Continued below (shown schematically in Table S2).

Once each residue is classified in this way, any residue P. Those falling into the ‘likely’ category were further analyzed to convert the label to ‘yes’ or ‘no’. Specifically, we took a step-by-step approach. P. Until the first residue labeled “yes” or “no” was encountered (that is, other residues labeled “possible” were ignored). The ‘possible’ label was converted to ‘yes’ if a residue labeled ‘yes’ was encountered before a residue labeled ‘no’ in either direction. The ‘possible’ label was converted to ‘no’ if a residue classified as ‘no’ was encountered before a residue classified as ‘yes’ in both directions. With each residue classified as ‘yes’ or ‘no’, we simply merged the residue windows with consecutive ‘yes’ labels to define an updated set of spike protein RDRs for the month. did.

Temporal analysis of expansion of recurrent deletion regions

To assess the expansion of regions undergoing deletions over time, we plotted a time series tile plot showing each month in which a particular deletion was identified as part of the RDR (all registered in that month). based on the GISAID sequence). Plotted residues were defined based on the RDR definitions above, built on the previously defined regions.15To distinguish amino acids that were included in a previously defined RDR from amino acids that (1) are part of a newly emerging RDR or (2) represent a contiguous extension from a previously defined RDR. , was shown in the plot.

Structural analysis of SARS-CoV-2 Spike protein

Structural analysis and illustrations were performed with PyMOL (version 2.3.4). Cryoelectron microscopy structure of the Spike protein characterizing its interaction with the neutralizing antibody 4A8 (PDB identifier: 7C2L).18taken from the PDB.

Amplicon sequencing of SARS-CoV-2 genomes from breakthrough infected individuals

This is a retrospective study of individuals undergoing polymerase chain reaction (PCR) testing for suspected SARS-CoV-2 infection at the Mayo Clinic and hospitals affiliated with the Mayo Health System.

SARS-CoV-2 RNA-positive upper respiratory tract swab specimens from vaccine breakthrough or COVID-19 reinfected patients were collected from the commercially available Ion AmpliSeq SARS-CoV-2 Research Panel (Life Technologies Corp., South San Francisco, CA). ) based on the “sequencing-by-synthesis” method. This assay amplifies 237 sequences 125-275 base pairs in length, covering 99% of the SARS-CoV-2 genome. Viral RNA was first manually extracted and purified from these clinical specimens using the MagMAX™ Viral/Pathogen Nucleic Acid Isolation Kit (Life Technologies Corp.), followed by automated reverse transcription PCR (RT-PCR) of viral sequences. , DNA library preparation (enzymatic shearing, adapter ligation, purification, normalization in an automated Genexus™ Integrated Sequencer (Life Technologies Corp.) using Genexus™ software version 6.2.1), DNA template preparation, and sequences. A no-template control and a positive SARS-CoV-2 control were included in each assay run for quality control purposes. Pangolin, the latest version of the web-based application tool32 For assignment of SARS-CoV-2 strains.Next Clade33 SARS-CoV-2 Wuhan-Hu-1 (lineage B, clade 19A) compared to the wild-type reference sequence for viral clade assignment, phylogenetic analysis, and S codon mutation calling.

SARS-CoV-2 sequences are available in the GISAID database (https://gisaid.org/). The database identifiers are as follows: EPI_ISL_12916271, EPI_ISL_12916270, EPI_ISL_12916273, EPI_ISL_12916272, EPI_ISL_12916275, EPI_ISL_12916310, EPI_ISL_12916274, EPI_ISL_12916277, EPI_ISL_12916276, EPI_ISL_12916313, EPI_ISL_12916279, EPI_ISL_12916314, EPI_ISL_12916278, EPI_ISL_12916311, EPI_ISL_12916312, EPI_ISL_12916317, EPI_ISL_12916318, EPI_ISL_12916315, EPI_ISL_12916316, EPI_ISL_12916319, EPI_ISL_12916260, EPI_ISL_12916262 , EPI_ISL_12916261, EPI_ISL_12916264, EPI_ISL_12916263, EPI_ISL_12916266, EPI_ISL_12916265, EPI_ISL_12916302, EPI_ISL_12916268, EPI_ISL_12916303, EPI_ISL_12916267, EPI_ISL_12916300, EPI_ISL_12916301, EPI_ISL_12916269, EPI_ISL_12916306, EPI_ISL_12916307, EPI_ISL_12916304, EPI_ISL_12916305, EPI_ISL_12916308, EPI_ISL_12916309, EPI_ISL_12916251, EPI_ISL_12916250, EPI_ISL_12916253, EPI_ISL_12916252, EPI_ISL_12916255, EPI_ISL_12916254, EPI_ISL_12916257 , EPI_ISL_12916256, EPI_ISL_12916259, EPI_ISL_12916258, EPI_ISL_12916240, EPI_ISL_12916242, EPI_ISL_12916241, EPI_ISL_12 916241 SL_12916244, EPI_ISL_12916243, EPI_ISL_12916246, EPI_ISL_12916245, EPI_ISL_12916248, EPI_ISL_12916247, EPI_ISL_12916249, EPI_ISL_12916239, EPI_ISL_12916238, EPI_ISL_12916290, EPI_ISL_12916291, EPI_ISL_12916294, EPI_ISL_12916295, EPI_ISL_12916292, EPI_ISL_12916293, EPI_ISL_12916298, EPI_ISL_12916331, EPI_ISL_12916332, EPI_ISL_12916299, EPI_ISL_12916296, EPI_ISL_12916297, EPI_ISL_12916330, EPI_ISL_12916335, EPI_ISL_12916336, EPI_ISL_12916333 , EPI_ISL_12916334, EPI_ISL_12916339, EPI_ISL_12916337, EPI_ISL_12916338, EPI_ISL_12916280, EPI_ISL_12916283, EPI_ISL_12916284, EPI_ISL_12916281, EPI_ISL_12916282, EPI_ISL_12916320, EPI_ISL_12916287, EPI_ISL_12916321, EPI_ISL_12916288, EPI_ISL_12916285, EPI_ISL_12916286, EPI_ISL_12916324, EPI_ISL_12916325, EPI_ISL_12916289, EPI_ISL_12916322, EPI_ISL_12916323, EPI_ISL_12916328, EPI_ISL_12916329, EPI_ISL_12916326, EPI_ISL_12916327.

ethical approval

This study was reviewed by the Mayo Clinic IRB and was determined to be excluded from human studies because it was a secondary use of anonymized data for analysis (45 CFR 46.104d, Category 4).