Department of Anthropology and Human Genetics, School of Life Sciences, Fudan University, Shanghai 200438, China; Human Phenome Institute, Fudan University, Shanghai 200438, China.
Criminal Investigation Department of Yunnan Province, Kunming 650021, Yunnan, China.
Forensic Sci Int Genet. 2022 Mar;57:102659. doi: 10.1016/j.fsigen.2021.102659. Epub 2021 Dec 29.
Improving the resolution of the current widely used Y-chromosomal short tandem repeat (Y-STR) dataset is of great importance for forensic investigators, and the current approach is limited, except for the addition of more Y-STR loci. In this research, a regional Y-DNA database was investigated to improve the Y-STR haplotype resolution utilizing a Y-SNP Pedigree Tagging System that includes 24 Y-chromosomal single nucleotide polymorphism (Y-SNP) loci. This pilot study was conducted in the Chinese Yunnan Zhaoyang Han population, and 3473 unrelated male individuals were enrolled. Based on data on the male haplogroups under different panels, the matched or near-matching (NM) Y-STR haplotype pairs from different haplogroups indicated the critical roles of haplogroups in improving the regional Y-STR haplotype resolution. A classic median-joining network analysis was performed using Y-STR or Y-STR/Y-SNP data to reconstruct population substructures, which revealed the ability of Y-SNPs to correct misclassifications from Y-STRs. Additionally, population substructures were reconstructed using multiple unsupervised or supervised dimensionality reduction methods, which indicated the potential of Y-STR haplotypes in predicting Y-SNP haplogroups. Haplogroup prediction models were built based on nine publicly accessible machine-learning (ML) approaches. The results showed that the best prediction accuracy score could reach 99.71% for major haplogroups and 98.54% for detailed haplogroups. Potential influences on prediction accuracy were assessed by adjusting the Y-STR locus numbers, selecting Y-STR loci with various mutabilities, and performing data processing. ML-based predictors generally presented a better prediction accuracy than two available predictors (Nevgen and EA-YPredictor). Three tree models were developed based on the Yfiler Plus panel with unprocessed input data, which showed their strong generalization ability in classifying various Chinese Han subgroups (validation dataset). In conclusion, this study revealed the significance and application prospects of Y-SNP haplogroups in improving regional Y-STR databases. Y-SNP haplogroups can be used to discriminate NM Y-STR haplotype pairs, and it is important for forensic Y-STR databases to develop haplogroup prediction tools to improve the accuracy of biogeographic ancestry inferences.
提高当前广泛使用的 Y 染色体短串联重复序列(Y-STR)数据集的分辨率对于法医调查人员非常重要,除了增加更多的 Y-STR 位点外,当前的方法有限。在这项研究中,利用包含 24 个 Y 染色体单核苷酸多态性(Y-SNP)位点的 Y-SNP 系谱标记系统,研究了一个区域性 Y-DNA 数据库,以提高 Y-STR 单倍型分辨率。这项初步研究在中国云南昭阳汉族人群中进行,共纳入了 3473 名无关男性个体。基于不同面板下的男性单倍群数据,来自不同单倍群的匹配或近匹配(NM)Y-STR 单倍型对表明单倍群在提高区域 Y-STR 单倍型分辨率方面的重要作用。使用 Y-STR 或 Y-STR/Y-SNP 数据进行经典的中位数连接网络分析,以重建群体亚结构,结果表明 Y-SNPs 能够纠正 Y-STR 中的误分类。此外,使用多种无监督或监督降维方法重建群体亚结构,表明 Y-STR 单倍型在预测 Y-SNP 单倍群方面的潜力。基于 9 种公开可用的机器学习(ML)方法构建了单倍群预测模型。结果表明,对于主要单倍群,最佳预测准确率得分可达 99.71%,对于详细单倍群,最佳预测准确率得分可达 98.54%。通过调整 Y-STR 位点数量、选择具有不同突变率的 Y-STR 位点以及进行数据处理,评估了对预测准确性的潜在影响。基于 ML 的预测器通常比两种可用预测器(Nevgen 和 EA-YPredictor)具有更高的预测准确性。基于未经处理输入数据的 Yfiler Plus 面板开发了三个树模型,它们在分类各种中国汉族亚群(验证数据集)方面表现出很强的泛化能力。总之,本研究揭示了 Y-SNP 单倍群在提高区域 Y-STR 数据库方面的重要性和应用前景。Y-SNP 单倍群可用于区分 NM Y-STR 单倍型对,对于法医 Y-STR 数据库,开发单倍群预测工具以提高生物地理祖籍推断的准确性非常重要。