Feng Yuanyuan, Shi Junchao, Li Zhanwei, Li Yongqian, Yang Jiaxi, Huang Shisheng, Zheng Jinfang, Han Wei, Qiao Yunbo, Zhang Jun, Liu Qi, Yang Yao, Hu Chunyi, Wu Lina, Zhang Xiaokang, Tang Jin, Huang Xingxu, Ma Peixiang
Research Center for Life Sciences computing, Zhejiang Lab, Hangzhou, China.
Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
Nat Commun. 2025 Aug 23;16(1):7877. doi: 10.1038/s41467-025-63160-4.
CRISPR-Cas systems revolutionize life science. Metagenomes contain millions of unknown Cas proteins. Traditional mining relies on protein sequence alignments. In this work, we employ an evolutionary scale language model (ESM) to learn the information beyond sequences. Trained with CRISPR-Cas data, ESM accurately identifies Cas proteins without alignment. Limited experimental data restricts feature prediction, but integrating with machine learning enables trans-cleavage activity prediction of uncharacterized Cas12a. We discover 7 undocumented Cas12a subtypes with unique CRISPR loci. Structural analyses reveal 8 subtypes of Cas1, Cas2, and Cas4. Cas12a subtypes display distinct 3D-folds. CryoEM analyses unveil unique RNA interactions with the uncharacterized Cas12a. These proteins show distinct double-strand and single-strand DNA cleavage preferences and broad PAM recognition. Finally, we establish a specific detection strategy for the oncogene SNP without traditional Cas12a PAM. This study highlights the potential of language models in exploring undocumented Cas protein function via gene cluster classification.
CRISPR-Cas系统彻底改变了生命科学。宏基因组包含数百万种未知的Cas蛋白。传统挖掘依赖于蛋白质序列比对。在这项工作中,我们采用进化尺度语言模型(ESM)来学习序列之外的信息。通过CRISPR-Cas数据训练,ESM无需比对就能准确识别Cas蛋白。有限的实验数据限制了特征预测,但与机器学习相结合能够对未表征的Cas12a进行反式切割活性预测。我们发现了7种具有独特CRISPR位点的未记录Cas12a亚型。结构分析揭示了Cas1、Cas2和Cas4的8种亚型。Cas12a亚型表现出不同的三维折叠结构。冷冻电镜分析揭示了未表征的Cas12a与RNA的独特相互作用。这些蛋白表现出不同的双链和单链DNA切割偏好以及广泛的PAM识别。最后,我们建立了一种针对没有传统Cas12a PAM的癌基因SNP的特异性检测策略。这项研究突出了语言模型通过基因簇分类探索未记录的Cas蛋白功能的潜力。