School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
Department of Pharmacy, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China.
Protein J. 2024 Oct;43(5):983-996. doi: 10.1007/s10930-024-10230-z. Epub 2024 Sep 7.
Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.
蛋白质溶解度是决定蛋白质稳定性、活性和功能的关键参数,在生物技术和生物化学领域具有广泛而深远的影响。准确预测和控制蛋白质溶解度对于研究和工业环境中成功表达和纯化蛋白质至关重要。本研究收集了可溶性和不溶性蛋白质的信息。在对蛋白质进行特征描述时,将其映射到 STRING 上,并根据功能和结构特征进行了特征描述。所有功能/结构特征都被整合到一个 5768 维的二进制向量中,以对蛋白质进行编码。使用了七种特征排序算法来分析功能/结构特征,得到了七个特征列表。这些列表经过增量特征选择,逐个结合四个分类算法,以构建有效的分类模型并确定与分类相关的重要功能/结构特征。确定了一些用于区分可溶性和不溶性蛋白质的基本功能/结构特征,包括 GO:0009987(细胞间通讯)和 GO:0022613(核糖核蛋白复合物生物发生)。使用支持向量机作为分类算法和 295 个优化的功能/结构特征的最佳分类模型生成的 F1 得分为 0.825,这可以成为区分可溶性蛋白质和不溶性蛋白质的有力工具。