用cutadapt软件来对双端测序数据去除接头

转载于:http://www.bio-info-trainee.com/1920.html

一般来讲,我们对测序数据进行QC,就三个大的方向:Quality trimming, Adapter removal, Contaminant filtering,当我们是双端测序数据的时候,去除接头时,也会丢掉太短的reads,就容易导致左右两端测序文件reads数量不平衡,有一个比较好的软件能解决这个问题,我比较喜欢的是cutadapt软件的PE模式来去除接头!尤其是做基因组或者转录组de novo 组装的时候,尤其要去掉接头,去的干干净净!
cutadapt是经典的python软件,但是因为我的linux服务器有点问题 ,可能是root权限问题,没有用pip install cutadapt 安装成功,我懒得搞这些了,其实可以自己去下载cutadapt的源码,然后进入源码文件夹里面 python setup.py install –user 到自己的 ~/.local/bin下面。
所以我用conda安装了cutadapt软件,http://www.bio-info-trainee.com/1906.html 所以我需要 python ~/miniconda2/pkgs/cutadapt-1.10-py27_0/bin/cutadapt –help 才能调用这个软件,不过,问题不大,我也就是试用一下。

首先用fastqc软件对测序数据进行简单检测,看看有什么接头:

clipboard

既然fastqc能探测到你的接头,说明它里面内置了所有的接头序列,在github可以查到: https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt
或者:Download common Illumina adapters from    https://github.com/vsbuffalo/scythe/blob/master/illumina_adapters.fa
TruSeq Universal Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Illumina Small RNA 3p Adapter 1 ATCTCGTATGCCGTCTTCTGCTTG
最严重的一般是TruSeq Universal Adapter , 而且它检测到你其它的接头可能就是这个 TruSeq Universal Adapter 的一部分而已~! 我们可以先用cutadapt去除试一下
cutadapt软件支持对PE 测序数据的处理,基本的用法是:
cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq
-a和-A是左右端测序数据的3端接头,-g和-G是左右端测序数据的5端接头。
支持fastq和fasta格式的gz压缩文件,必要时用-f参数指定测序文件数据格式即可。
我自己的实际例子如下:
python ~/miniconda2/pkgs/cutadapt-1.10-py27_0/bin/cutadapt \
-a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -a ATCTCGTATGCCGTCTTCTGCTTG \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -A CAAGCAGAAGACGGCATACGAGAT \
-e 0.1 -O 5 -m 50  -n  2 –pair-filter=both \
-o read_PE1.fq -p read_PE2.fq test1.fastq test2.fastq >& log.txt
参数解释如下:
两个-a 参数后面接的是两种接头,两个-A参加后面接的是同样的两个接头的反向互补序列!
加上 -n 2 是因为有两个接头,我的测序数据里面有可能一个read里面含有两种接头,所以需要检测两次。
-e 0.1 -O 5 -m 50 是标准参数,自己看readme就好了,其中-m 设置为50 是表示去除接头后如果read长度小于50我就不要了,因为我是PE150测序的,这也就是为什么要用PE模式来去除接头,保证过滤后的reads还是数量继续平衡的。
去除了接头之后的文件再用fastqc软件跑一下,很明显可以看到 接头去除成功啦!
当然,这个软件还有很多其它的功能,我就不一一介绍了: 参考: http://cutadapt.readthedocs.io/en/stable/guide.html
帮助文档还是蛮长的:
cutadapt version 1.9.1
Copyright (C) 2010-2015 Marcel Martin < marcel.martin@scilifelab.se>
cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
Replace “ADAPTER” with the actual sequence of your 3’ adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.
Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name ‘-’ for
standard input/output. Without the -o option, output is sent to standard output.
Citation:
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
Use “cutadapt –help” to see all command-line options.
See http://cutadapt.readthedocs.org/ for full documentation.
Options:
  –version             show program’s version number and exit
  -h, –help            show this help message and exit
  –debug               Print debugging information.
  -f FORMAT, –format=FORMAT
                        Input file format; can be either ‘fasta’, ‘fastq’ or
                        ‘sra-fastq’. Ignored when reading csfasta/qual files
                        (default: auto-detect from file name extension).
  Options that influence how the adapters are found:
    Each of the three parameters -a, -b, -g can be used multiple times and
    in any combination to search for an entire set of adapters of possibly
    different types. Only the best matching adapter is trimmed from each
    read (but see the –times option). Instead of giving an adapter
    directly, you can also write file:FILE and the adapter sequences will
    be read from the given FASTA FILE.
    -a ADAPTER, –adapter=ADAPTER
                        Sequence of an adapter that was ligated to the 3’ end.
                        The adapter itself and anything that follows is
                        trimmed. If the adapter sequence ends with the ‘$’
                        character, the adapter is anchored to the end of the
                        read and only found if it is a suffix of the read.
    -g ADAPTER, –front=ADAPTER
                        Sequence of an adapter that was ligated to the 5’ end.
                        If the adapter sequence starts with the character ‘^’,
                        the adapter is ‘anchored’. An anchored adapter must
                        appear in its entirety at the 5’ end of the read (it
                        is a prefix of the read). A non-anchored adapter may
                        appear partially at the 5’ end, or it may occur within
                        the read. If it is found within a read, the sequence
                        preceding the adapter is also trimmed. In all cases,
                        the adapter itself is trimmed.
    -b ADAPTER, –anywhere=ADAPTER
                        Sequence of an adapter that was ligated to the 5’ or
                        3’ end. If the adapter is found within the read or
                        overlapping the 3’ end of the read, the behavior is
                        the same as for the -a option. If the adapter overlaps
                        the 5’ end (beginning of the read), the initial
                        portion of the read matching the adapter is trimmed,
                        but anything that follows is kept.
    -e ERROR_RATE, –error-rate=ERROR_RATE
                        Maximum allowed error rate (no. of errors divided by
                        the length of the matching region) (default: 0.1)
    –no-indels         Do not allow indels in the alignments (allow only
                        mismatches). (default: allow both mismatches and
                        indels)
    -n COUNT, –times=COUNT
                        Remove up to COUNT adapters from each read (default:
                        1)
    -O LENGTH, –overlap=LENGTH
                        Minimum overlap length. If the overlap between the
                        read and the adapter is shorter than LENGTH, the read
                        is not modified. This reduces the no. of bases trimmed
                        purely due to short random adapter matches (default:
                        3).
    –match-read-wildcards
                        Allow IUPAC wildcards in reads (default: False).
    -N, –no-match-adapter-wildcards
                        Do not interpret IUPAC wildcards in adapters.
    –no-trim           Match and redirect reads to output/untrimmed-output as
                        usual, but do not remove adapters.
    –mask-adapter      Mask adapters with ‘N’ characters instead of trimming
                        them.
  Additional read modifications:
    -u LENGTH, –cut=LENGTH
                        Remove LENGTH bases from the beginning or end of each
                        read. If LENGTH is positive, bases are removed from
                        the beginning of each read. If LENGTH is negative,
                        bases are removed from the end of each read. This
                        option can be specified twice if the LENGTHs have
                        different signs.
    -q [5’CUTOFF,]3’CUTOFF, –quality-cutoff=[5’CUTOFF,]3’CUTOFF
                        Trim low-quality bases from 5’ and/or 3’ ends of reads
                        before adapter removal. If one value is given, only
                        the 3’ end is trimmed. If two comma-separated cutoffs
                        are given, the 5’ end is trimmed with the first
                        cutoff, the 3’ end with the second. See documentation
                        for the algorithm. (default: no trimming)
    –quality-base=QUALITY_BASE
                        Assume that quality values in FASTQ are encoded as
                        ascii(quality + QUALITY_BASE). This needs to be set to
                        64 for some old Illumina FASTQ files. Default: 33
    –trim-n            Trim N’s on ends of reads.
    -x PREFIX, –prefix=PREFIX
                        Add this prefix to read names. Use {name} to insert
                        the name of the matching adapter.
    -y SUFFIX, –suffix=SUFFIX
                        Add this suffix to read names; can also include {name}
    –strip-suffix=STRIP_SUFFIX
                        Remove this suffix from read names if present. Can be
                        given multiple times.
    –length-tag=TAG    Search for TAG followed by a decimal number in the
                        description field of the read. Replace the decimal
                        number with the correct length of the trimmed read.
                        For example, use –length-tag ‘length=’ to correct
                        fields like ‘length=123’.
  Options for filtering of processed reads:
    –discard-trimmed, –discard
                        Discard reads that contain an adapter. Also use -O to
                        avoid discarding too many randomly matching reads!
    –discard-untrimmed, –trimmed-only
                        Discard reads that do not contain the adapter.
    -m LENGTH, –minimum-length=LENGTH
                        Discard trimmed reads that are shorter than LENGTH.
                        Reads that are too short even before adapter removal
                        are also discarded. In colorspace, an initial primer
                        is not counted (default: 0).
    -M LENGTH, –maximum-length=LENGTH
                        Discard trimmed reads that are longer than LENGTH.
                        Reads that are too long even before adapter removal
                        are also discarded. In colorspace, an initial primer
                        is not counted (default: no limit).
    –max-n=COUNT       Discard reads with too many N bases. If COUNT is an
                        integer, it is treated as the absolute number of N
                        bases. If it is between 0 and 1, it is treated as the
                        proportion of N’s allowed in a read.
  Options that influence what gets output to where:
    –quiet             Do not print a report at the end.
    -o FILE, –output=FILE
                        Write modified reads to FILE. FASTQ or FASTA format is
                        chosen depending on input. The summary report is sent
                        to standard output. Use ‘{name}’ in FILE to
                        demultiplex reads into multiple files. (default:
                        trimmed reads are written to standard output)
    –info-file=FILE    Write information about each read and its adapter
                        matches into FILE. See the documentation for the file
                        format.
    -r FILE, –rest-file=FILE
                        When the adapter matches in the middle of a read,
                        write the rest (after the adapter) into FILE.
    –wildcard-file=FILE
                        When the adapter has N bases (wildcards), write
                        adapter bases matching wildcard positions to FILE.
                        When there are indels in the alignment, this will
                        often not be accurate.
    –too-short-output=FILE
                        Write reads that are too short (according to length
                        specified by -m) to FILE. (default: discard reads)
    –too-long-output=FILE
                        Write reads that are too long (according to length
                        specified by -M) to FILE. (default: discard reads)
    –untrimmed-output=FILE
                        Write reads that do not contain the adapter to FILE.
                        (default: output to same file as trimmed reads)
  Colorspace options:
    -c, –colorspace    Enable colorspace mode: Also trim the color that is
                        adjacent to the found adapter.
    -d, –double-encode
                        Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).
    -t, –trim-primer   Trim primer base and the first color (which is the
                        transition to the first nucleotide)
    –strip-f3          Strip the _F3 suffix of read names
    –maq, –bwa        MAQ- and BWA-compatible colorspace output. This
                        enables -c, -d, -t, –strip-f3 and -y ‘/1’.
    –no-zero-cap       Do not change negative quality values to zero in
                        colorspace data. By default, they are changed to zero
                        since many tools have problems with negative
                        qualities.
    -z, –zero-cap      Change negative quality values to zero. This is
                        enabled by default when -c/–colorspace is also
                        enabled. Use the above option to disable it.
  Paired-end options:
    The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts.
    -A ADAPTER          3’ adapter to be removed from second read in a pair.
    -G ADAPTER          5’ adapter to be removed from second read in a pair.
    -B ADAPTER          5’/3 adapter to be removed from second read in a pair.
    -U LENGTH           Remove LENGTH bases from the beginning or end of each
                        second read (see –cut).
    -p FILE, –paired-output=FILE
                        Write second read in a pair to FILE.
    –pair-filter=(any|both)
                        Which of the reads in a paired-end read have to match
                        the filtering criterion in order for it to be
                        filtered. Default: any.
    –interleaved       Read and write interleaved paired-end reads.
    –untrimmed-paired-output=FILE
                        Write second read in a pair to this FILE when no
                        adapter was found in the first read. Use this option
                        together with –untrimmed-output when trimming paired-
                        end reads. (Default: output to same file as trimmed
                        reads.)
    –too-short-paired-output=FILE
                        Write second read in a pair to this file if pair is
                        too short. Use together with –too-short-output.
    –too-long-paired-output=FILE
                        Write second read in a pair to this file if pair is
                        too long. Use together with –too-long-output.


 

I think assembly is one of the things were you definitely want to remove adapters, if not you’ll have spurious overlaps creating erroneous contigs/transcripts, just because they share an adapter.
去接头对基因组或者转录组组装是非常重要的:
  • 8
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
### 回答1: GNU Barcode是一款开源的条形码生成软件,它可以在Linux、Unix和Windows等操作系统上运行。barcode_create是GNU Barcode中的一个命令行工具,用于创建不同类型的条形码,例如EAN、UPC、Code 39等。 使用barcode_create命令可以轻松地生成代码,只需指定所需的参数即可。例如,要创建一个EAN-13条形码,只需在命令行中输入"barcode_create -t ean13 -o output.eps 1234567890128"。在此命令中,"-t"参数用于指定条形码类型,"-o"参数是输出文件的名称,"1234567890128"是要编码的数字。 barcode_create支持多种条形码类型和选项,使用起来非常灵活。除了命令行工具外,GNU Barcode还提供了GUI界面和库函数,使得开发者可以方便地在自己的程序中集成条形码生成功能。 总之,GNU Barcode是一款功能强大的开源条形码生成软件,而barcode_create则是其中的一个实用工具。无论是个人用户还是企业开发者,都可以使用GNU Barcode以及其相关工具来轻松创建和管理条形码。 ### 回答2: gnu barcode及其子命令barcode_create是一个免费的开源的条形码生成工具,它在Linux操作系统上运行,可以生成多种条形码类型,包括EAN-13、UPC-A、CODE-128等。 使用barcode_create命令,用户可以通过命令行输入数据和条形码类型参数来生成指定类型的条形码文件。在生成条形码时,用户还可以设置条形码大小、颜色等参数。 同时,gnu barcode提供了C、C++、Java、Perl和Python等多种编程语言的API,方便开发人员在应用程序中集成条形码生成功能。 总之,gnu barcode barcode_create是一个实用、灵活、易用的条形码生成工具,适用于Linux用户和开发人员。 ### 回答3: GNU Barcode是一款开源的条形码生成器,允许用户生成不同类型的条形码,如EAN、UPC、Code 128等。Barcode_create是GNU Barcode中用于创建条形码的命令行工具。该命令可以接收用户输入的参数,生成相应的条形码,并输出到终端或指定的文件中。 使用barcode_create需要按照一定的格式输入参数,其中最基本的参数是要生成的条形码类型和相应的编码数据。例如,要生成一个EAN-13码,可以使用下面的命令: barcode_create -b ean13 -o barcode.ps "5901234123457" 其中,-b参数指定了要生成的条形码类型,-o参数指定了输出文件的名称,"5901234123457"则是EAN-13的编码数据。除了这些基本参数外,barcode_create还支持多种其他参数,如条形码高度、缩放比例、边距等。 总的来说,GNU Barcode和barcode_create是一种非常实用的工具,能够方便地帮助用户生成各种类型的条形码。在各种需要使用条形码的场景中,如商业货物管理、图书编目、票务管理等,都可以使用GNU Barcode来生成相应的条形码。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值