python高效处理文件_从Python高效处理FASTQ文件

最新推荐文章于 2024-05-24 10:05:25 发布

weixin_39797780

最新推荐文章于 2024-05-24 10:05:25 发布

阅读量634

点赞数

文章标签： python高效处理文件

fastq-and-furious是一个Python库，专注于提高处理FASTQ文件（生物信息学中常用的基因测序数据格式）的性能。它提供了一个高效的迭代器，用于读取和解析FASTQ文件，甚至支持gzip压缩的文件。通过对比多个解析器的基准测试，显示fastq-and-furious在解析速度上优于其他一些知名库。此外，该库的设计允许按需构建条目组件，适用于过滤、索引和定制数据处理场景。

摘要由CSDN通过智能技术生成

fastq-and-furious

(... because it turned out that the performance bottleneck for an algorithm-focused implementation of bottom-sketches (MinHash sketches) was the parsing of FASTQ files).

68747470733a2f2f7472617669732d63692e6f72672f6c676175746965722f66617374712d616e642d667572696f75732e7376673f6272616e63683d6d6173746572 68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f66617374712d616e642d667572696f75732e737667

Efficient handling of FASTQ files(*) from Python ( *: no multi-line FASTQ though, in which case an exception will be raised when parsing).

Installation

Python >= 3.5 (with headers) and a C99-compliant compiler are the only requirements.

pip install fastq-and-furious

pip install git+https://github.com/lgautier/fastq-and-furious.git

Should C extensions be unavaible (no C compiling possible, Pypy, other reasons), the Python-only version can still be used (see section Performance below).

To run the tests after installation, one will need a clone of the repository. From the root of the repository, run:

python -m pytest --cov=fastqandfurious --cov-report xml --cov-report term tests.py

If coverage is not installed / is not working for you, the tests can be run without coverage analysis with:

python -m pytest tests.py

Performance

There is a little utility to try it out on your own systems and files (there are options, available with the flag --help).

The two mode are "speed" and "compare", the former benchmarking the speed of different parsers and the second comparing the output of different parsers (not so good to be fast if not correct).

Speed

python -m fastqandfurious.demo.benchmark speed

Note that third-party library parsing FASTQ files are required in order to be able to run the full benchmark.

With a gzip-compressed FASTQ file of 60MB (size compressed) with 273,639 entries, the benchmark is (the throughput is for the DNA sequences in the file - headers and quality strings are not counted):

parser

throughput

notes

screed

23.27MB/s

biopython

10.18MB/s

biopython_fastqiterator

50.19MB/s

Bio.SeqIO.QualityIO.FastqGeneralIterator

biopython_adapter

25.5MB/s

fastqandfurious creating biopython objects

ngs_plumbing

32.98MB/s

fastqandfurious

51.27MB/s

pure python

fastqandfurious_c

67.38MB/s

parsing individual entries with C extension

fastqandfurious_c_index

56.81MB/s

w/ index of entry positions

With a gzip-compressed FASTQ file of 700MB (size compressed) with 20,853,696 entries, the benchmark is (the throughput is for the DNA sequences in the file - headers and quality strings are not counted):

parser

throughput

screed

3.63MB/s

biopython

3.57MB/s

biopython_fastqiterator

14.09MB/s

biopython_adapter

4.10MB/s

ngs_plumbing

5.77MB/s

fastqandfurious

11.51MB/s

fastqandfurious_c

21.88MB/s

fastqandfurious_c_index

14.86MB/s

Compare

To compare the output of two parsers, for example biopython and our parser:

python -m fastqandfurious.demo.benchmark compare biopython fastqandfurious

Documentation

The documentation is currently a little sparse. The docstring for fastqandfurious.fastqandfurious.readfastq_iter() is a good starting point. The code for the benchmark (see below) is also a good source of information as it can show how to use when compared to other parsers benchmarked.

Short short doc

In a nutshell, the reader takes a file-like object, a buffersize (number of bytes), and a function called when yielding entries (to produce "entry" objects):

from fastqandfurious import fastqandfurious

from fastqandfurious.fastqandfurious import entryfunc

bufsize = 20000

with open("a/fastq/file.fq", "rb") as fh:

it = fastqandfurious.readfastq_iter(fh, bufsize, entryfunc)

for sequence in it:

# do something

pass

That design allows to decouple decompression from parsing. Reading data from a gzip-compressed file is working the same:

import gzip

from fastqandfurious import fastqandfurious

from fastqandfurious.fastqandfurious import entryfunc

with gzip.open("a/fastq/file.fq", "rb") as fh:

it = fastqandfurious.readfastq_iter(fh, bufsize, entryfunc)

for entry in it:

# do something

pass

Instant faster code (just add

water adapter code)

That design also lets us just drop the parser into an existing code base, or keep working with a library you are most familiar with, writing a short adapter (and observe immediate performance gains - see benchmark below).

For biopython, it could look like:

from fastqandfurious import fastqandfurious

from fastqandfurious._fastqandfurious import arrayadd_b

from Bio.SeqRecord import SeqRecord

from array import array

def biopython_entryfunc(buf, posarray, globaloffset):

name = buf[posarray[0]:posarray[1]].decode('ascii')

quality = array('b')

quality.frombytes(buf[posarray[4]:posarray[5]])

arrayadd_b(quality, -33)

entry = SeqRecord(seq=buf[posarray[2]:posarray[3]].decode('ascii'),

id=name,

name=name,

letter_annotations={'phred_quality': quality})

return entry

bufsize = 20000

with open("a/fastq/file.fq", "rb") as fh:

it = fastqandfurious.readfastq_iter(fh, bufsize, biopython_entryfunc)

for entry in it:

# do something

pass

Performance by design

The design is obviously also offering various performance gains by allowing to only build entry components as needed. For example, writing a filter on read length could be done with:

def lengthfilter_entryfunc(buf, posarray):

LENGTH_THRESHOLD = 25

if posarray[3] - posarray[2] < LENGTH_THRESHOLD:

return buf[posarray[2]:posarray[3]]

else:

return None

with open("a/fastq/file.fq", "rb") as fh:

it = fastqandfurious.readfastq_iter(fh, bufsize, lengthfilter_entryfunc)

for sequence in it:

if sequence is None:

# do nothing

else:

# do something

pass

Fetching the positions for the elements of an entry (name/ID, sequence, quality) is also allowing one to store the positions associated with a FASTQ for future use (see the code for fastqandfurious_c_index in fastqandfurious.demo.benchmark).

This is essentially like storing a table of positions:

name_beg

name_end

seq_beg

seq_end

quality_beg

quality_end

172

274

275

295

296

446

448

598

Whenever the FASTQ must be used again, that table can be used to quickly extract data elements without having to parse them.

Unfortunately, when doing so the benchmark does not show improvements over fastqandfurious_c but the approach opens the door for implementing masking strategies to avoid saving FASTQ after each filtering or read-trimming step. Excluding reads or trimming either end of the read could be done by only deleting rows or modifying the values in the table.

If having the quality as a sequence of integer ajusted for the eventual offset of 33, there is a also a C utility:

from fastqandfurious._fastqandfurious import arrayadd_b

from array import array

quality = array('b')

quality.frombytes(buf[posarray[4]:posarray[5]])

arrayadd_b(quality, -33)

The design is obviously also offering various performance gains by allowing to only build entry components as needed. For example, writing a filter on read length could be done with:

def lengthfilter_entryfunc(buf, posarray):

LENGTH_THRESHOLD = 25

if posarray[3] - posarray[2] < LENGTH_THRESHOLD:

return buf[posarray[2]:posarray[3]]

else:

return None

with open("a/fastq/file.fq", "rb") as fh:

it = fastqandfurious.readfastq_iter(fh, bufsize, lengthfilter_entryfunc)

for sequence in it:

if sequence is None:

# do nothing

else: