下机数据处理

最新推荐文章于 2023-08-03 16:15:26 发布

sunwanying123

最新推荐文章于 2023-08-03 16:15:26 发布

阅读量2.3k

点赞数

本文链接：https://blog.csdn.net/sunwanying123/article/details/81330161

版权

FASTQC

基本格式# fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN # 主要是包括前面的各种选项和最后面的可以加入N个文件 # -o --outdir FastQC生成的报告文件的储存路径，生成的报告的文件名是根据输入来定的 # --extract 生成的报告默认会打包成1个压缩文件，使用这个参数是让程序不打包 # -t --threads 选择程序运行的线程数，每个线程会占用250MB内存，越多越快咯 # -c --contaminants 污染物选项，输入的是一个文件，格式是Name [Tab] Sequence，里面是可能的污染序列，如果有这个选项，FastQC会在计算时候评估污染的情况，并在统计的时候进行分析，一般用不到 # -a --adapters 也是输入一个文件，文件的格式Name [Tab] Sequence，储存的是测序的adpater序列信息，如果不输入，目前版本的FastQC就按照通用引物来评估序列时候有adapter的残留 # -q --quiet 安静运行模式，一般不选这个选项的时候，程序会实时报告运行的状况。

链接：https://www.jianshu.com/p/a1eb03d63083

FASTP

去低质量软件

-A 这个参数代表不去adapter

-a 接头序列

-f, --trim_front1 切除前面的几个base (int [=0])

-t, --trim_tail1

-g, --trim_poly_g polyG tail trimming, 自动切除Illumina NextSeq/NovaSeq data

--poly_g_min_len 最小的polyg的个数默认为10

-G 不能去掉polyg

-x ployx

-Q, --disable_quality_filtering

-q 去除低质量默认小于15

-u 低质量的比率默认40%

-n 如果N的数量大于【5】即去除这条reads

-L 如果有这个参数则表明不filter 长度

-l [15]长度小于这个数值被filter

-y 碱基与下一个碱基有差异的占比

# a 51-bp sequence, with 3 bases that is different from its next base seq = 'AAAATTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGCCCC' complexity = 3/(51-1) = 6%

-Y 30%默认占比

-c PE数据中重叠部分碱基进行纠错如果一个高质量一个低质量那么低质量会被纠正

来自 <https://github.com/OpenGene/fastp>

sliding window cutting by 计算框的平均质量. 默认没有这个功能,

-5, --cut_by_quality5 从5'开始切质量低的框, and trim leading N bases.

-3, --cut_by_quality3 enable per read cutting by quality in tail (3'), and trim trailing N bases.

-W, --cut_window_size,

-M, --cut_mean_quality.

-U 切除特定的分子

-umi_loci特定位置做unique molecular identifier （index1 index2 reads1( head of reads 1 )reads2 --umi_len 指定UMI的长度只有在head of reads 的时候 --umi_skip 接着UMI后去掉几个bp ）

-p, --overrepresentation_analysis enable overrepresented sequence analysis.

-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])

-j, --json the json format report file name (string [=fastp.json])

-h, --html the html format report file name (string [=fastp.html])

-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])

-w, --thread worker thread number, default is 3 (int [=3])

-s, --split 切分结果文件~1M ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])

-S, --split_by_lines

-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])

SICKEL

/sickle pe -t sanger -f aaa.fastq -r aaa2.fastq -o test1.fastq -p test2.fastq -s s.fastq &

rimmomatic 的好处在于，它不但可以用来切除illumina测序平台的接头序列，还可以去除由我们自己指定的特定接头序列，而且同时也能够过滤read末尾的低质量序列，多线程运行，数据快，适合RNA-seq数据的预处理，由于数据结果中reads长度不一，剔除的数据量要多一些，不适用于denovo组装的short reads.

rimmomatic 使用参数

ILLUMINACLIP，接头序列切除参数。LLUMINACLIP:TruSeq3-PE.fa:2:30:10（省掉了路径）意思分别是：TruSeq3-PE.fa是接头序列，2是比对时接头序列时所允许的最大错误数；30指的是要求PE的两条read同时和PE的adapter序列比对，匹配度加起来超30%，那么就认为这对PE的read含有adapter，并在对应的位置需要进行切除【注】。10和前面的30不同，它指的是，我就什么也不管，反正只要这条read的某部分和adpater序列有超过10%的匹配率，那么就代表含有adapter了，需要进行去除；

【注】测序的时候往往只会在测到一些部分的adapter，因此read和adaper的时候肯定是不需要要求百分百匹配率的，上述30%和10%其实是比较推荐的值。