Igolkina Anna A, Vorbrugg Sebastian, Rabanal Fernando A, Liu Hai-Jun, Ashkenazy Haim, Kornienko Aleksandra E, Fitz Joffrey, Collenberg Max, Kubica Christian, Mollá Morales Almudena, Jaegle Benjamin, Wrightsman Travis, Voloshin Vitaly, Bezlepsky Alexander D, Llaca Victor, Nizhynska Viktoria, Reichardt Ilka, Bezrukov Ilja, Lanz Christa, Bemm Felix, Flood Pádraic J, Nemomissa Sileshi, Hancock Angela, Guo Ya-Long, Kersey Paul, Weigel Detlef, Nordborg Magnus
Gregor Mendel Institute, Austrian Academy of Sciences, Vienna, Austria.
Max Planck Institute for Biology Tübingen, Tübingen, Germany.
Nat Genet. 2025 Aug 19. doi: 10.1038/s41588-025-02293-0.
Making sense of whole-genome polymorphism data is challenging, but it is essential for overcoming the biases in SNP data. Here we analyze 27 genomes of Arabidopsis thaliana to illustrate these issues. Genome size variation is mostly due to tandem repeat regions that are difficult to assemble. However, while the rest of the genome varies little in length, it is full of structural variants, mostly due to transposon insertions. Because of this, the pangenome coordinate system grows rapidly with sample size and ultimately becomes 70% larger than the size of any single genome, even for n = 27. Finally, we show how short-read data are biased by read mapping. SNP calling is biased by the choice of reference genome, and both transcriptome and methylome profiling results are affected by mapping reads to a reference genome rather than to the genome of the assayed individual.
理解全基因组多态性数据具有挑战性,但对于克服SNP数据中的偏差至关重要。在这里,我们分析了27个拟南芥基因组来说明这些问题。基因组大小的变化主要是由于难以组装的串联重复区域。然而,虽然基因组的其余部分长度变化不大,但却充满了结构变异,主要是由于转座子插入。因此,泛基因组坐标系统随着样本量的增加而迅速增长,最终比任何单个基因组的大小大70%,即使对于n = 27也是如此。最后,我们展示了短读长数据如何因读段比对而产生偏差。SNP的识别受参考基因组选择的影响,转录组和甲基化组分析结果都受到将读段比对到参考基因组而非被测个体基因组的影响。