一些常见的生信分析数据文件格式,可参考该网址:
https://learn.gencore.bio.nyu.edu/ngs-file-formats/fastaa-format/
【1】fastq2fasta
使用awk转化fastq格式为fasta:
awk '{if((NR+3)%4==0)printf ">"$1;if((NR+2)%4==0)print "\n"$1}' ${fq} > ${fa}
【2】sam2fastq
sam/bam转换为fastq格式:samtools fastq
# 转换成1个fq
samtools fastq ${sam} > ${fq}
# 如果是paried reads比对的bam, 默认在fasta的read末尾加"/1"(read1)或"/2"(read2), 如果要去除"/1"或"/2"信息,加`-n`
samtools fastq -n ${sam} > ${fq}
# 转换成2个fq: 慎用,fq的R1与R2的read ID不对应?
samtools fastq -1 ${fq1} -2 ${fq2} ${bam}
-n
: 输出不标记"/1"或 “/2”, Read1、Read2的标记
【3】sam2fasta
sam/bam转换为fasta格式:samtools fasta
samtools fasta -n ${sam} > ${fa}
【4】fastq2sam
fastq转为sam,相当于比对,使用bwa
:
bwa mem [options] <idxbase> ${fq1} [${fq2}] > ${sam}
【5】fastq<=>fasta互相转换,使用python编写脚本
① fastq转fasta,比较简单,主要去除质量值一行即可。
# fastq2fasta.py
import sys
import gzip
fq_in = sys.argv[1]
fa_out = sys.argv[2]
reads_count = sys.argv[3] # if set [-1], means output all reads
with gzip.open(fq_in, 'r') as f, open(fa_out, 'w') as pf:
cnt = 0
while True:
rd_id = f.readline()
if not rd_id or cnt==int(reads_count):
break
seq = f.readline()
tmp = f.readline()
qual = f.readline()
pf.write('>'+rd_id+seq)
cnt+=1
使用方法:
python fastq2fasta.py ${in_fq} ${out_fa}
② fasta转fastq,就是需要增加质量值信息,自定义用F
质量值补充。示例脚本增加了一个参数是,定义fastq的最长长度。
#fasta2fastq.py
import sys
fa_f = sys.argv[1]
fq_f = sys.argv[2]
len_max = int(sys.argv[3]) # fq len max: 150bp
with open(fa_f, 'r') as f, open(fq_f, 'w') as pf:
while True:
name = f.readline().strip()
if not name:
break
if name.startswith('>'):
tmp_name = name.strip('>')
read_id = '@'+tmp_name
seq = f.readline().strip()
if len(seq) <= len_max:
read_info = '{rd_id}\n{seq}\n+\n{qual}\n'.format(rd_id=read_id, seq=seq, qual='F'*len(seq))
else:
read_info = ''
cnt = int(len(seq) / len_max)
for i in range(cnt):
tmp_id = read_id + '_' + str(i)
tmp_seq = seq[i*len_max:(i+1)*len_max]
read_info += '{rd_id}\n{seq}\n+\n{qual}\n'.format(rd_id=tmp_id, seq=tmp_seq, qual='F'*len_max)
lseq = len(seq) % len_max
# print('lseq:', lseq)
if lseq != 0:
tmp_id = read_id + '_' + str(cnt)
tmp_seq =seq[-lseq:]
read_info += '{rd_id}\n{seq}\n+\n{qual}\n'.format(rd_id=tmp_id, seq=tmp_seq, qual='F'*lseq)
# print(name, read_info)
pf.write(read_info)
使用方法:
python fasta2fastq.py ${in_fa} ${out_fq} 150
上述代码是,将fasta转换为单端的fastq。类似的,可以转换成双端的fastq。
更多:
fasta索引文件.fai格式:samtools构建的索引.fai文件格式,不理解OFFSET
的解释是什么意思 😦