Dutta Anirban, Haque Mohammed Monzoorul, Bose Tungadri, Reddy C V S K, Mande Sharmila S
Bio-Sciences R&D Division, TCS Innovation Labs, Tata Consultancy Services Limited, 54-B, Hadapsar Industrial Estate, Pune 411013, Maharashtra, India.
J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.
Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression efficiency, data repositories continue to prefer GZIP over available specialized fastq compression algorithms. Ease of deployment, high processing speed and portability are the reasons for this preference. This study presents FQC, a fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. This study also proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq files, without necessitating any additional storage requirements. For academic users, Linux, Windows, and Mac implementations (both 32 and 64-bit) of FQC are freely available for download at: https://metagenomics.atc.tcs.com/compression/FQC .
序列数据存储库以压缩格式存档和传播fastq数据。尽管压缩效率相对较低,但数据存储库仍然比现有的专门fastq压缩算法更喜欢使用GZIP。易于部署、高处理速度和可移植性是这种偏好的原因。本研究提出了FQC,一种fastq压缩方法,它除了比GZIP提供显著更高的压缩增益外,还包含了数据存储库/最终用户普遍采用所需的功能。本研究还提出了一种新颖的存档策略,允许序列存储库同时存储和传播fastq文件的无损以及(多个)有损变体,而无需任何额外的存储要求。对于学术用户,FQC的Linux、Windows和Mac实现(32位和64位)均可在以下网址免费下载:https://metagenomics.atc.tcs.com/compression/FQC 。