RSeQC软件包提供了许多有用的模块,可以全面评估高通量序列数据,尤其是RNA序列数据。一些基本模块快速检查序列质量、核苷酸组成偏差、PCR偏差和GC偏差,而RNA序列特定模块评估测序饱和度、映射读取分布、覆盖均匀性、链特异性、转录水平RNA完整性等。
1. 安装
RSeQC是python包,有很多模块:
- bam2fq.py
- bam2wig.py
- bam_stat.py
- clipping_profile.py
- deletion_profile.py
- divide_bam.py
- FPKM_count.py
- geneBody_coverage.py
- geneBody_coverage2.py
- infer_experiment.py
- inner_distance.py
- insertion_profile.py
- junction_annotation.py
- junction_saturation.py
- mismatch_profile.py
- normalize_bigwig.py
- overlay_bigwig.py
- read_distribution.py
- read_duplication.py
- read_GC.py
- read_hexamer.py
- read_NVC.py
- read_quality.py
- RNA_fragment_size.py
- RPKM_count.py
- RPKM_saturation.py
- spilt_bam.py
- split_paired_bam.py
- tin.py
pip install RSeQC
2. bam_stat.py:统计BAM/SAM文件的比对情况
bam_stat.py -i test_rna_seq.bam
3. bam2fq.py: bam转fq
bam2fq.py -i test_rna_seq.bam -o test_bam2fq_out
4. geneBody_coverage.py
#bed文件可以有gtf/gff文件转换而来。genomic features通常使用bed 或者gff文件表示,两者最基本的
#信息就是染色体或Contig的ID或编号、DNA的正负链信息以及在染色体上的起始和终止位置数值。两种文件的区#别在于,BED文件中起始坐标为0,结束坐标至少是1,GFF中起始坐标是1而结束坐标至少是1。把BED转成对应
#的GFF格式(仅保留两者相同信息)
# cat hg38.refGene.gtf|awk '{if ($3=="transcript") print $0}'>hg38.refTranscripts.gtf
# cat hg38.refGene.gtf|awk '{print $1,$4-1,$5-1,$12,$6,$7}'|tr -d ';' >hg38.bed
# 下载基因model
# https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/
# wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_RefSeq.bed.gz/ ./
# 解压
gunzip hg38_RefSeq.bed.gz
#注: 染色体编号要一致: 有的为 1 VS 有的为 chr1
# 下载的染色体编号以chr开头
sed -n 's/^chr//p' hg38_RefSeq.bed > hg38_RefSeq_2.bed # 去除chr
# 如果有多个bam文件
samtools test_merged_rna_seq.bam test1_rna_seq.bam test2_rna_seq.bam test3_rna_seq.bam
samtools sort test_merged_rna_seq.bam >test_merged_rna_seq.sorted.bam
samtools index test_merged_rna_seq.sorted.bam
geneBody_coverage.py -i test_merged_rna_seq.sorted.bam -r hg38_RefSeq_2.bed -o test_merged_geneBody_coverage
# 注:太慢了,有待改进! (多线程,启用c代码)
5. bam2wig.py: bam转wig
# 下载染色体大小文件
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
bash fetchChromSizes hg38 > hg38.chrom.sizes
# 注:bam文件中染色体编号没有chr
sed -n 's/^chr//p' hg38.chrom.sizes > hg38.chrom.sizes_2 # 去除chr
samtools sort test_rna_seq.bam > test_rna_seq.sort.bam
samtools index test_rna_seq.sort.bam
bam2wig.py -i test_rna_seq.sorted.bam -s hg38.chrom.sizes_2 -o TestOut -u
# -u, --skip-multi-hits
## wig转二进制的BigWig
#wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
#chmod +x wigToBigWig
#./wigToBigWig TestOut.wig hg38.chrom.sizes_2 TestOutBigWig.bw
##bedGraph 转 bigwig
#wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
#./bedGraphToBigWig in.bedGraph chrom.sizes out.bw
6. clipping_profile.py :计算截短核苷酸在读数上的分布
clipping_profile.py -i test_merged_rna_seq.sorted.bam -s "PE" -o out
7. inner_distance.py:计算reads对之间的内部距离
inner_distance.py -i test_merged_rna_seq.sorted.bam -o output -r hg38_RefSeq_2.bed
参考:
https://pythonhosted.org/RSeQC/#usage-information
https://cloud.tencent.com/developer/article/1771487?ivk_sa=1024320u