prinseq

最新推荐文章于 2024-08-30 08:07:28 发布

songyi10

最新推荐文章于 2024-08-30 08:07:28 发布

阅读量506

点赞数

分类专栏：生信软件使用文章标签：数据挖掘

本文链接：https://blog.csdn.net/songyi10/article/details/127578628

版权

生信软件使用专栏收录该内容

5 篇文章 2 订阅

订阅专栏

PRINSEQ全称是PReprocessing and INformation of SEQuences,拥有在线分析管道，也有命令行版本。

PRINSEQ是一个可以用来过滤，转换(reformat)，或者剪切基因组/宏基因组数据的一个工具，他可以统计基因组序列质量信息，以图表的形似输出（感觉上和fastqc类似，但实际不一样）。

网址：https://prinseq.sourceforge.net/manual.html

下载与安装

cd /opt/biosoft
mkdir PRiNSEQ && cd PRINSEQ
wget -c https://sourceforge.net/projects/prinseq/files/standalone/prinseq-lite-0.20.4.tar.gz && tar -zxvf prinseq-lite-0.20.4.tar.gz

值得注意的是，最新的下载版本也是2013年的，没有更新的版本了。

下载下来后在prinseq文件夹下有三个可执行程序prinseq-graphs-noPCA.pl, prinseq-graphs.pl, prinseq-lite.pl。

简易演示

质控fastq文件

perl prinseq-lite.pl -verbose -fastq test.fq -graph_data test.gd -out_good null -out_bad null
perl prinseq-graphs.pl -i test.gd -png_all -o test
perl prinseq-graphs.pl -i test.gd -html_all -o test

prinseq-lite.pl

承担主要功能，能够对fastq/fasta文件进行转换与质控，主要参数如下

NAME
    PRINSEQ - PReprocessing and INformation of SEQuence data

VERSION
    PRINSEQ-lite 0.20.4

SYNOPSIS
    perl prinseq-lite.pl [-h] [-help] [-version] [-man] [-verbose] [-fastq
    input_fastq_file] [-fasta input_fasta_file] [-fastq2 input_fastq_file_2]
    [-fasta2 input_fasta_file_2] [-qual input_quality_file] [-min_len
    int_value] [-max_len int_value] [-range_len ranges] [-min_gc int_value]
    [-max_gc int_value] [-range_gc ranges] [-min_qual_score int_value]
    [-max_qual_score int_value] [-min_qual_mean int_value] [-max_qual_mean
    int_value] [-ns_max_p int_value] [-ns_max_n int_value] [-noniupac]
    [-seq_num int_value] [-derep int_value] [-derep_min int_value] [-lc_method
    method_name] [-lc_threshold int_value] [-trim_to_len int_value]
    [-trim_left int_value] [-trim_right int_value] [-trim_left_p int_value]
    [-trim_right_p int_value] [-trim_ns_left int_value] [-trim_ns_right
    int_value] [-trim_tail_left int_value] [-trim_tail_right int_value]
    [-trim_qual_left int_value] [-trim_qual_right int_value] [-trim_qual_type
    type] [-trim_qual_rule rule] [-trim_qual_window int_value]
    [-trim_qual_step int_value] [-seq_case case] [-dna_rna type] [-line_width
    int_value] [-rm_header] [-seq_id id_string] [-out_format int_value]
    [-out_good filename_prefix] [-out_bad filename_prefix] [-phred64]
    [-stats_info] [-stats_len] [-stats_dinuc] [-stats_tag] [-stats_dupl]
    [-stats_ns] [-stats_assembly] [-stats_all] [-aa] [-graph_data file]
    [-graph_stats string] [-qual_noscale] [-no_qual_header] [-exact_only]
    [-log file] [-custom_params string] [-params file] [-seq_id_mappings file]

DESCRIPTION
    PRINSEQ will help you to preprocess your genomic or metagenomic sequence
    data in FASTA (and QUAL) or FASTQ format. The lite version does not
    require any non-core perl modules for processing.

参数详解

输入参数

选项	意义
-fastq	输入文件，必须是fastq文件，且不能压缩，且可以从标准输入读取文件
-fasta	输入文件，必须是fasta文件，且不能压缩，且可以从标准输入读取文件
-qual	质量输入文件（有意义么）
-fastq2	只能输入Read2 fastq格式文件，可以识别的后缀有1/2,_L/_R, _left/_right，
-fasta2	只能输入Read2 fasta格式文件，可以识别的后缀有1/2,_L/_R, _left/_right，
-params	参数文件，可以把prinseq的参数都保存在这个文件里每个参数一行，且参数与参数值之间用tab/space分割注释信息在首行，且用#开头
-si13	等同于-phred64
-phred64	输入fastq文件质量值为64
-aa	输入文件为蛋白质序列文件，若指定该参数，以下参数无法设置：stats_dinuc,stats_tag,stats_ns,dna_rna

输出参数

选项	意义
-out_format	1 支输出FASTA 2 输出FASTA和QUAL 3 输出FASTQ 4 输出FASTA和FASTQ 5 输出全部
-out_good	默认输出到与输入文件同一文件夹下，带后缀`_prinseq_good_XXXX`(XXXX是随机字符串，防止取代之前) 若是双端文件，后面还会带有_1,_1_singletons,_2,_2_singletons
-out_bad	默认输出到与输入文件同一文件夹下，带后缀`_prinseq_bad_XXXX`(XXXX是随机字符串，防止取代之前) 若是双端文件，后面还会带有_1,_1_singletons,_2,_2_singletons
-log	log文件
-graph_data	画图数据文件，若不指定文件名，则输出`inputname.gd`
-graph_stats	计算包含在graph_data文件中的统计信息
-qual_noscale	不对图上的质量做归一化
-no_qual_header	为了减小文件大小
-exact_only	去冗余，只使用不重复的序列进行计算
-seq_id_mappings	使用新的文件会对原来序列的名字进行重命名

-out_good stdout -out_bad null就可以只输出需要的文件，而删除不需要的文件
可统计信息如下
ld (Length distribution),
gc (GC content distribution),
qd (Base quality distribution),
ns (Occurence of N),
pt (Poly-A/T tails),
ts (Tag sequence check),
aq (Assembly quality measure),
de (Sequence duplication - exact only),
da (Sequence duplication - exact + 5’/3’),
sc (Sequence complexity),
dn (Dinucleotide oddsratios, includes the PCA plots)

过滤参数

参数	意义
-min_len/-max_len	序列最短/长长度
-range_len	过滤序列的长度范围
-min_gc/-max_gc	序列的最低/高GC含量
-range_gc	过滤序列的GC含量范围
-min_qual_score/-max_qual_score	序列的最小/大质量值
-min_qual_mean/-max_qual_mean	序列的最小/最大平均质量值
-ns_max_p	N在序列中所占的最大半份比
-ns_max_n	N在序列中最大的数目
-seq_num	所要保存的序列数目
-derep	所要过滤的重复类型 1 提取重复值 2 5端重复 3 3端重复 4 方向互补完全重复的序列 5 3/5端互补重复序列

剪切参数

参数	意义
-trim_to_len	从3端开始修剪的最终长度
-trim_left	从左端开始修剪的5端的长度
-trim_right	从右端开始修剪的3端的长度
-trim_left_p	左端长度修剪百分比
-trim_right_p	右端长度修剪百分比

转换参数

参数	意义
-seq_case	将序列大小写转换
-dna_rna	将DNA转化为RNA需咯额
-seq_id	使用指定字符串替代文件中的基因序列名字

mySeq_10 will generate the IDs (in FASTA format)
>mySeq_101, >mySeq_102, >mySeq_103, …

prinseq-graphs.pl

接受上一步质控数据，用来绘图的软件。

VERSION
    PRINSEQ-graphs 0.6

SYNOPSIS
    perl prinseq-graphs.pl [-h] [-help] [-version] [-man] [-verbose] [-i
    input_graph_data_file] [-png_all] [-html_all] [-log file]

DESCRIPTION
    PRINSEQ will help you to preprocess your genomic or metagenomic sequence
    data in FASTA (and QUAL) or FASTQ format. The graphs version allows users
    of the lite version to generate graphs similar to the web version.

	***** INPUT OPTIONS *****
    -i <file>
            Input file containing the graph data generated by the lite
            version.

    ***** OUTPUT OPTIONS *****
    -o <string>
            By default, the output files are created in the same directory as
            the input file with an additional "_prinseq_graphs_XXXX" in their
            name (where XXXX is replaced by random characters to prevent
            overwriting previous files). To change the output filename and
            location, specify the filename using this option. The file
            extension will be added automatically.

    -png_all
            Use this option to generate PNG files with the graphs.

    -html_all
            Use this option to generate a HTML file with the graphs and
            tables.

    -log <file>
            Log file to keep track of parameters, errors, etc. The log file
            name is optional. If no file name is given, the log file name will
            be "inputname.log". If the log file already exists, new content
            will be added to the file.

songyi10

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
prinseq

PRINSEQ全称是PReprocessing and INformation of SEQuences,拥有在线分析管道，也有命令行版本。PRINSEQ是一个可以用来过滤，转换(reformat)，或者剪切基因组/宏基因组数据的一个工具，他可以统计基因组序列质量信息，以图表的形似输出（感觉上和fastqc类似，但实际不一样）。网址：https://prinseq.sourceforge.net/manual.html。
复制链接

扫一扫

专栏目录