Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235.
Laboratory of Genetics, Department of Energy (DOE) Great Lakes Bioenergy Research Center, Center for Genomic Science Innovation, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI 53726.
Proc Natl Acad Sci U S A. 2024 Apr 30;121(18):e2315314121. doi: 10.1073/pnas.2315314121. Epub 2024 Apr 26.
How genomic differences contribute to phenotypic differences is a major question in biology. The recently characterized genomes, isolation environments, and qualitative patterns of growth on 122 sources and conditions of 1,154 strains from 1,049 fungal species (nearly all known) in the yeast subphylum Saccharomycotina provide a powerful, yet complex, dataset for addressing this question. We used a random forest algorithm trained on these genomic, metabolic, and environmental data to predict growth on several carbon sources with high accuracy. Known structural genes involved in assimilation of these sources and presence/absence patterns of growth in other sources were important features contributing to prediction accuracy. By further examining growth on galactose, we found that it can be predicted with high accuracy from either genomic (92.2%) or growth data (82.6%) but not from isolation environment data (65.6%). Prediction accuracy was even higher (93.3%) when we combined genomic and growth data. After the actose utilization genes, the most important feature for predicting growth on galactose was growth on galactitol, raising the hypothesis that several species in two orders, Serinales and Pichiales (containing the emerging pathogen and the genus , respectively), have an alternative galactose utilization pathway because they lack the genes. Growth and biochemical assays confirmed that several of these species utilize galactose through an alternative oxidoreductive D-galactose pathway, rather than the canonical pathway. Machine learning approaches are powerful for investigating the evolution of the yeast genotype-phenotype map, and their application will uncover novel biology, even in well-studied traits.
基因组差异如何导致表型差异是生物学中的一个主要问题。最近对酵母亚门子囊菌纲中 1049 个真菌物种的 122 个来源和 1154 个菌株的基因组、分离环境以及定性生长模式进行了描述,为解决这一问题提供了一个强大而复杂的数据集。我们使用基于这些基因组、代谢和环境数据训练的随机森林算法,对几种碳源的生长进行了高精度预测。已知参与这些来源同化的结构基因以及在其他来源中的生长存在/缺失模式是提高预测准确性的重要特征。通过进一步研究半乳糖的生长情况,我们发现可以从基因组(92.2%)或生长数据(82.6%)中高精度预测半乳糖的生长,但不能从分离环境数据(65.6%)中预测。当我们将基因组和生长数据结合起来时,预测准确性甚至更高(93.3%)。在半乳糖利用基因之后,预测半乳糖生长的最重要特征是对半乳糖醇的生长,这提出了一个假设,即两个目(Serinales 和 Pichiales,分别包含新兴病原体 和属)中的两个目中的几个物种具有替代的半乳糖利用途径,因为它们缺乏 基因。生长和生化测定证实,这些物种中的几个通过替代的氧化还原 D-半乳糖途径而不是经典的途径利用半乳糖。机器学习方法对于研究酵母基因型-表型图谱的进化非常有效,它们的应用将揭示新的生物学,即使是在研究充分的特征中也是如此。