Khaiwal Sakshi, De Chiara Matteo, Barré Benjamin P, Barrio-Hernandez Inigo, Stenberg Simon, Beltrao Pedro, Warringer Jonas, Liti Gianni
CNRS, INSERM, IRCAN, Côte d'Azur University, Nice, France.
Institute of Molecular Systems Biology, ETH Zürich, Zürich, 8093, Switzerland.
Mol Syst Biol. 2025 Nov;21(11):1466-1489. doi: 10.1038/s44320-025-00136-y. Epub 2025 Sep 1.
Most organismal traits result from the complex interplay of many genetic and environmental factors, making their prediction difficult. Here, we used machine learning (ML) models to explore phenotype predictions for 223 traits measured across 1011 genome-sequenced Saccharomyces cerevisiae strains isolated worldwide. We benchmarked a ML pipeline with multiple linear and non-linear models to predict phenotypes from genotypes and gene expression, and determined gradient boosting machines as the best-performing model. Gene function disruption scores and gene presence/absence emerged as best predictors, suggesting a considerable contribution of the accessory genome in controlling phenotypes. The prediction accuracy broadly varied among phenotypes, with stress resistance being easier to predict compared to growth across nutrients. ML identified relevant genomic features linked to phenotypes, including high-impact variants with established relationships to phenotypes, despite these being rare in the population. Near-perfect accuracies were achieved when other phenomics data mostly in similar conditions were used, suggesting that useful information can be conveyed across phenotypes. Overall, our study underscores the power of ML to interpret the functional outcome of genetic variants.