Suppr超能文献

使用 Dask 进行可扩展的转录组学分析:在数据科学和机器学习中的应用。

Scalable transcriptomics analysis with Dask: applications in data science and machine learning.

机构信息

Department of Computer Science, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.

Laboratory of Artificial Intelligence and Decision Support, INESC TEC, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal.

出版信息

BMC Bioinformatics. 2022 Nov 30;23(1):514. doi: 10.1186/s12859-022-05065-3.

Abstract

BACKGROUND

Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary.

METHODS

In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics.

RESULTS

This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask .

CONCLUSION

By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.

摘要

背景

基因表达研究是生物和生物医学研究中的重要工具。表达谱中携带的信号有助于为不同疾病的预测、诊断和预后提取特征。数据科学,特别是机器学习,在基因表达分析中有许多应用。然而,随着基因组学数据集维度的增加,需要可扩展的解决方案。

方法

本文回顾了机器学习管道中的主要步骤和瓶颈,以及可扩展数据科学背后的主要概念,包括并发和并行编程。我们讨论了 Dask 框架的优势,以及如何将其与 Python 科学环境集成,以在计算生物学和生物信息学中执行数据分析。

结果

本综述通过不同的案例研究说明了 Dask 在推动数据科学应用中的作用。有关这些过程的详细文档和代码可在 https://github.com/martaccmoreno/gexp-ml-dask 上获得。

结论

通过展示 Dask 在转录组学分析中何时以及如何使用,本综述将作为一个切入点,帮助基因组数据科学家开发更具可扩展性的数据分析程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7aa7/9710082/39866b0c7304/12859_2022_5065_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验