基于RNA-seq的基因表达分析

最新推荐文章于 2025-08-06 12:40:08 发布

原创

最新推荐文章于 2025-08-06 12:40:08 发布 · 1.6w 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#数据 #RNA-seq #软件 #基因表达

本文详细介绍了基于RNA-seq的基因表达分析流程，从序列预处理到均一化表达量计算。使用AdapterRemoval和bbduk2进行数据清洁，hisat2进行序列比对，featureCounts统计表达counts，并通过FPKM进行均一化，以评估基因表达水平。

我的青春
最近在做一些小麦基因的表达分析，想到使用RNA-seq的数据进行生物信息学分析，并且比我做实验用的组织还要多。

序列预处理

下载数据之后，首先要对数据进行低质量序列和载体序列等污染序列去除，我这里结合了两个软件AdapterRemoval和bbduk2, bbduk2是bbmap中的一个子程序。

AdapterRemoval --file1 input1.fastq.gz --file2 input2.fastq.gz --qualitybase 33 --trimns --minlength 40 --threads 10 --adapter-list ~/adapterremoval-2.1.7/benchmark/adapters/adapters.fasta --output1 output1.fastq.gz --output2 output2.fastq.gz

可在终端键入AdapterRemoval，即可看见详细参数。如下

AdapterRemoval ver. 2.1.7

This program searches for and removes remnant adapter sequences from
your read data.  The program can analyze both single end and paired end
data.  For detailed explanation of the parameters, please refer to the
man page.  For comments, suggestions  and feedback please contact Stinus
Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com).

If you use the program, please cite the paper:
    Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid
    adapter trimming, identification, and read merging.
    BMC Research Notes, 12;9(1):88.

    http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2


Arguments:                           Description:
  --help                             Display this message.
  --version                          Print the version string.

  --file1 FILE                       Input file containing mate 1 reads or single-ended reads [REQUIRED].
  --file2 FILE                       Input file containing mate 2 reads [OPTIONAL].

FASTQ OPTIONS:
  --qualitybase BASE                 Quality base used to encode Phred scores in input; either 33, 64, or solexa [current: 33].
  --qualitybase-output BASE          Quality base used to encode Phred scores in output; either 33, 64, or solexa. By default, reads will be written in the same format as the that specified using --qualitybase.
  --qualitymax BASE                  Specifies the maximum Phred score expected in input files, and used when writing output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to'~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and 0 - 62 for offset 64 and Solexa scores [default: 41].
  --mate-separator CHAR              Character separating the mate number (1 or 2) from the read name in FASTQ records [default: '/'].
  --interleaved                      This option enables both the --interleaved-input option and the
                                       --interleaved-output option [current: off].
  --interleaved-input                The (single) input file provided contains both the mate 1 and mate 2 reads, one pair after the other, with one mate 1 reads followed by one mate 2 read. This option is implied by the --interleaved option [current: off].
  --interleaved-output               If set, trimmed paired-end reads are written to a single file containing mate 1 and mate 2 reads, one pair after the other. This option is implied by the --interleaved option [current: off].

OUTPUT FILES:
  --basename BASENAME                Default prefix for all output files for which no filename was explicitly set [current: your_output].
  --settings FILE                    Output file containing information on the parameters used in the run as well as overall statistics on the reads after trimming [default: BASENAME.settings]
  --output1 FILE                     Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated (PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]
  --output2 FILE                     Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated (only used in PE mode, but not if --interleaved-output is enabled)]
  --singleton FILE                   Output file to which containing paired reads for which the mate has been discarded [default: BASENAME.singleton.truncated]
  --outputcollapsed FILE             If --collapsed is set, contains overlapping mate-pairs which have been merged into a single read (PE mode) or reads for which the adapter was identified by a minimum overlap, indicating that the entire template molecule is present. This does not include which have subsequently been trimmed due to low-quality or ambiguous nucleotides [default: BASENAME.collapsed]
  --outputcollapsedtruncated FILE    Collapsed reads (see --outputcollapsed) which were trimmed due the presence of low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]
  --discarded FILE                   Contains reads discarded due to the --minlength, --maxlength or --maxns options [default: BASENAME.discarded]

OUTPUT COMPRESSION:
  --gzip                             Enable gzip compression [current: off]
  --gzip-level LEVEL                 Compression level, 0 - 9 [current: 6]
  --bzip2                            Enable bzip2 compression [current: off]
  --bzip2-level LEVEL                Compression level, 0 - 9 [current: 9]

TRIMMING SETTINGS:
  --adapter1 SEQUENCE                Adapter sequence expected to be found in mate 1 reads [current: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].
  --adapter2 SEQUENCE                Adapter sequence expected to be found in mate 2 reads [current: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].
  --adapter-list FILENAME            Read table of white-space separated adapters pairs, used as if the first column was supplied to --adapter1, and the second column was supplied to --adapter2; only the first adapter in each pair is required SE trimming mode [current:<not set>].

  --mm MISMATCH_RATE                 Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when identifing adapters].
  --maxns MAX                        Reads containing more ambiguous bases (N) than this number after trimming are discarded [current: 1000].
  --shift N                          Consider alignments where up to N nucleotides are missing from the 5' termini [current: 2].

  --trimns                           If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
  --trimqualities                    If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
  --minquality PHRED                 Inclusive minimum; see --trimqualities for details [current: 2]
  --minlength LENGTH                 Reads shorter than this length are discarded following trimming [current: 15].
  --maxlength LENGTH                 Reads longer than this length are discarded following trimming [current:4294967295].
  --collapse                         When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert,and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads,putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written to the the same files [current: off].
  --minalignmentlength LENGTH        If --collapse is set, paired reads must overlap at least this number of bases to be collapsed, and single-ended reads must overlap at least this number of bases with the adapter to be considered complete template molecules [current:11].
  --minadapteroverlap LENGTH         In single-end mode, reads are only trimmed if the overlap between read and the adapter is at least X bases long, not counting ambiguous nucleotides (N); this is independant of the --minalignmentlength when using --collapse, allowing a conservative selection of putative complete inserts while ensuring that all possible adapter contamination is trimmed [current: 0].

DEMULTIPLEXING:
  --barcode-list FILENAME            List of barcodes or barcode pairs for single or double-indexed demultiplexing. Note that both indexes should be specified for both single-end and paired-end trimming, if double-indexed multiplexing was used, in order to ensure that the demultiplexed reads can be trimmed correctly [current: <not set>].
  --barcode-mm N                     Maximum number of mismatches allowed when counting mismatches in both the mate 1 and the mate 2 barcode for paired reads.
  --barcode-mm-r1 N                  Maximum number of mismatches allowed for the mate 1 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.
  --barcode-mm-r2 N                  Maximum number of mismatches allowed for the mate 2 barcode; if not set, this value is equal to the '--barcode-mm' value; cannot be higher than the '--barcode-mm value'.

MISC:
  --identify-adapters                Attempt to identify the adapter pair of PE reads, by searching for overlapping reads [current: off].
  --seed SEED                        Sets the RNG seed used when choosing between bases with equal Phred scores when collapsing. Note that runs are not deterministic if more than one thread is used. If not specified, a seed is generated using the current time.
  --threads THREADS                  Maximum number of threads [current: 1]

其中--identify-adapters 参数可以在PE reads中鉴定载体序列
bbduk2的命令如下

/data1/masw/bbmap/bbduk2.sh -da in=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_1.gz IN2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_1.gz out=ATW_AKOSW_2_1_D0KD1ACXX.IND12.fastq_2.gz out2=ATW_AKOSW_2_2_D0KD1ACXX.IND12.fastq_2.gz stats=1.2.txt k=20 minlength=40 mink=8 hdist=2 ref=/data1/masw/bbmap/resources/sequencing_artifacts.fa.gz tbo entropy=0.5 entropywindow=50 entropyk=5

同样的在终端下键入命令/data1/masw/bbmap/bbduk2.sh 可以查看详细的参数

Written by Brian Bushnell
Last modified June 27, 2016

BBDuk2 is like BBDuk but can kfilter, kmask, and ktrim in a single pass.
It does not replace BBDuk, and is only provided to allow maximally efficient
pipeline integration when multiple steps will be performed.  The syntax is 
slightly different.

Description:  Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

Usage:  bbduk2.sh in=<input file> out=<output file> fref=<contaminant files>

Input may be stdin or a fasta or fastq file, compressed or uncompressed.
If you pipe via stdin/stdout, please include the file type; e.g. for gzipped 
fasta input, set in=stdin.fa.gz


Input parameters:
in=<file>           Main input. in=stdin.fq will pipe from stdin.
in2=<file>          Input for 2nd read of pairs in a different file.
fref=<file,file>    Comma-delimited list of fasta reference files for filtering.
rref=<file,file>    Comma-delimited list of fasta reference files for right-trimming.
lref=<file,file>    Comma-delimited list of fasta reference files for left-trimming.
mref=<file,file>    Comma-delimited list of fasta reference files for masking.
fliteral=<seq,seq>  Comma-delimited list of literal sequences for filtering.
rliteral=<seq,seq>  Comma-delimited list of literal sequences for right-trimming.
lliteral=<seq,seq>  Comma-delimited list of literal sequences for left-trimming.
mliteral=<seq,seq>  Comma-delimited list of literal sequences for masking.
touppercase=f       (tuc) Change all bases upper-case.
interleaved=auto    (int) t/f overrides interleaved autodetection.
qin=auto            Input quality offset: 33 (Sanger), 64, or auto.
reads=-1            If positive, quit after processing X reads or pairs.
copyundefined=f     (cu) Process non-AGCT IUPAC reference bases by making all
                    possible unambiguous copies.  Intended for short motifs
                    or adapter barcodes, as time/memory use is exponential.

Output parameters:
out=<file>          (outnonmatch) Write reads here that do not contain 
                    kmers matching the database.  'out=stdout.fq' will pipe 
                    to standard out.
out2=<file>         (outnonmatch2) Use this to write