Suppr超能文献

欢迎来到大叶植物:改善非模式植物基因组注释的最佳实践。

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes.

作者信息

Vuruputoor Vidya S, Monyak Daniel, Fetter Karl C, Webster Cynthia, Bhattarai Akriti, Shrestha Bikash, Zaman Sumaira, Bennett Jeremy, McEvoy Susan L, Caballero Madison, Wegrzyn Jill L

机构信息

Department of Ecology and Evolutionary Biology University of Connecticut Storrs Connecticut 06269 USA.

出版信息

Appl Plant Sci. 2023 Aug 8;11(4):e11533. doi: 10.1002/aps3.11533. eCollection 2023 Jul-Aug.

Abstract

PREMISE

Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions.

METHODS

The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.

RESULTS

Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended.

DISCUSSION

While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.

摘要

前提

真核生物结构基因组注释缺乏评估质量和完整性的稳健标准,因为基因组注释软件是使用模式生物开发的,通常缺乏全面评估最终预测质量和准确性的基准测试。由于植物基因组规模大、转座元件丰富且倍性多变,其注释尤其具有挑战性。本研究调查了基因组质量、复杂性、序列读取输入和方法对蛋白质编码基因预测的影响。

方法

在适用于五个植物基因组的流行BRAKER和MAKER工作流程的背景下,研究了重复序列屏蔽、长读段和短读段输入以及从头和基因组引导的蛋白质证据的影响。对注释的结构特征和序列相似性进行了基准测试。

结果

反映基因结构、相互相似性搜索比对以及单外显子/多外显子基因计数的基准能够更全面地了解注释准确性。仅来自RNA读取比对的转录本不足以进行基因组注释。建议采用结合基于证据和从头开始方法的基因预测工作流程,短读段和长读段相结合可以改善基因组注释。按照当前工作流程,添加来自从头组装、基因组引导的转录组组装或OrthoDB全长蛋白质的蛋白质证据会产生更多假定的假阳性结果。强烈建议使用功能和结构过滤器进行后处理。

讨论

虽然非模式植物基因组的注释仍然很复杂,但本研究为输入和方法途径提供了建议。我们讨论了一套生成最佳植物基因组注释的最佳实践,并提出了一组更稳健的指标来评估所得预测结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f46/10439824/c2462c58aeae/APS3-11-e11533-g003.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验