Suppr超能文献

Petabase 规模的序列比对促进病毒发现。

Petabase-scale sequence alignment catalyses viral discovery.

机构信息

Independent researcher, Corte Madera, CA, USA.

Independent researcher, Vancouver, British Columbia, Canada.

出版信息

Nature. 2022 Feb;602(7895):142-147. doi: 10.1038/s41586-021-04332-2. Epub 2022 Jan 26.

Abstract

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 10 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.

摘要

公共数据库包含了大量的核酸序列,但由于缺乏有效的方法来搜索这个超过 20 千万亿字节且呈指数级增长的数据集,因此对其进行系统探索受到了抑制。在这里,我们开发了一种云计算基础设施 Serratus,以实现兆兆字节规模的超高通量序列比对。我们在 570 万个具有生物多样性的样本(1020 千万亿字节)中搜索了标志性基因 RNA 依赖性 RNA 聚合酶,并鉴定出了超过 10 种新的 RNA 病毒,从而将已知病毒的数量扩大了近一个数量级。我们分别对与冠状病毒、丁型肝炎病毒和巨型噬菌体相关的新型病毒进行了特征描述,并分析了它们的环境宿主。为了推动病毒发现的持续革命,我们建立了一个免费的、全面的此类数据和工具数据库。扩大病毒的已知序列多样性可以揭示新兴病原体的进化起源,并改善病原体监测,以预测和减轻未来的大流行。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验