Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK.
Department of Zoology, University of Oxford, Oxford, UK.
BMC Genomics. 2022 Feb 11;23(1):121. doi: 10.1186/s12864-022-08358-2.
More than 2 million SARS-CoV-2 genome sequences have been generated and shared since the start of the COVID-19 pandemic and constitute a vital information source that informs outbreak control, disease surveillance, and public health policy. The Pango dynamic nomenclature is a popular system for classifying and naming genetically-distinct lineages of SARS-CoV-2, including variants of concern, and is based on the analysis of complete or near-complete virus genomes. However, for several reasons, nucleotide sequences may be generated that cover only the spike gene of SARS-CoV-2. It is therefore important to understand how much information about Pango lineage status is contained in spike-only nucleotide sequences. Here we explore how Pango lineages might be reliably designated and assigned to spike-only nucleotide sequences. We survey the genetic diversity of such sequences, and investigate the information they contain about Pango lineage status.
Although many lineages, including the main variants of concern, can be identified clearly using spike-only sequences, some spike-only sequences are shared among tens or hundreds of Pango lineages. To facilitate the classification of SARS-CoV-2 lineages using subgenomic sequences we introduce the notion of designating such sequences to a "lineage set", which represents the range of Pango lineages that are consistent with the observed mutations in a given spike sequence.
We find that many lineages, including the main variants-of-concern, can be reliably identified by spike alone and we define lineage-sets to represent the lineage precision that can be achieved using spike-only nucleotide sequences. These data provide a foundation for the development of software tools that can assign newly-generated spike nucleotide sequences to Pango lineage sets.
自 COVID-19 大流行开始以来,已经生成和共享了超过 200 万个 SARS-CoV-2 基因组序列,这些序列构成了重要的信息来源,为疫情控制、疾病监测和公共卫生政策提供了信息。Pango 动态命名法是一种流行的 SARS-CoV-2 基因上不同谱系的分类和命名系统,包括关注变体,该系统基于对完整或近乎完整的病毒基因组的分析。然而,由于多种原因,可能会生成仅涵盖 SARS-CoV-2 刺突基因的核苷酸序列。因此,了解刺突核苷酸序列中包含多少关于 Pango 谱系状态的信息非常重要。在这里,我们探讨了如何可靠地指定和分配 Pango 谱系到仅刺突的核苷酸序列。我们调查了这些序列的遗传多样性,并研究了它们包含的有关 Pango 谱系状态的信息。
尽管可以使用仅刺突序列清楚地识别许多谱系,包括主要关注的变体,但一些仅刺突序列在数十个或数百个 Pango 谱系中共享。为了促进使用亚基因组序列对 SARS-CoV-2 谱系进行分类,我们引入了将此类序列指定给“谱系集”的概念,该谱系集代表与给定刺突序列中的观察到的突变一致的 Pango 谱系范围。
我们发现,许多谱系,包括主要关注的变体,可以仅通过刺突可靠地识别,我们定义了谱系集来表示可以使用仅刺突核苷酸序列实现的谱系精度。这些数据为开发可以将新生成的刺突核苷酸序列分配给 Pango 谱系集的软件工具提供了基础。