Item request has been placed!
×
Item request cannot be made.
×

Processing Request
BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.
Item request has been placed!
×
Item request cannot be made.
×

Processing Request
- Author(s): Piñeiro C;Piñeiro C; Pichel JC; Pichel JC
- Source:
GigaScience [Gigascience] 2022 Dec 28; Vol. 12. Date of Electronic Publication: 2023 Jul 31.
- Publication Type:
Journal Article; Research Support, Non-U.S. Gov't
- Language:
English
- Additional Information
- Source:
Publisher: Oxford University Press Country of Publication: United States NLM ID: 101596872 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 2047-217X (Electronic) Linking ISSN: 2047217X NLM ISO Abbreviation: Gigascience Subsets: MEDLINE
- Publication Information:
Publication: 2017- : New York : Oxford University Press
Original Publication: London : BioMed Central
- Subject Terms:
- Abstract:
Background: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node.
Results: Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line.
Conclusions: BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.
(© The Author(s) 2023. Published by Oxford University Press GigaScience.)
- References:
Bioinformatics. 2015 Jan 15;31(2):166-9. (PMID: 25260700)
Bioinformatics. 2009 Jun 1;25(11):1422-3. (PMID: 19304878)
Brief Bioinform. 2021 Jul 20;22(4):. (PMID: 33341884)
PLoS One. 2017 May 11;12(5):e0177459. (PMID: 28494014)
Gigascience. 2021 Feb 16;10(2):. (PMID: 33590861)
PLoS One. 2016 Oct 5;11(10):e0163962. (PMID: 27706213)
Gigascience. 2022 Dec 28;12:. (PMID: 37522758)
Nucleic Acids Res. 2020 Jan 8;48(D1):D941-D947. (PMID: 31584097)
Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995. (PMID: 34791404)
Nucleic Acids Res. 2010 Apr;38(6):1767-71. (PMID: 20015970)
Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. (PMID: 3162770)
Bioinformatics. 2016 Jun 15;32(12):1883-4. (PMID: 27153699)
- Contributed Indexing:
Keywords: Big Data; FASTA/FASTQ files; Parallelism; Performance
- Publication Date:
Date Created: 20230731 Date Completed: 20230801 Latest Revision: 20241118
- Publication Date:
20250114
- Accession Number:
PMC10388699
- Accession Number:
10.1093/gigascience/giad062
- Accession Number:
37522758
No Comments.