1
01-rna-seq从头开始 卖萌哥 Linux生信技能树Linux安装软件 Linux实战RNASEQ上游_YoungLeelight的博客-CSDN博客
1.首先构建项目所需目录,这样比较哦清晰。共分为四个目录,
2.原始数据上传测序数据
或者自己通过kingfisher下载sra原始数据或者 fastq.gz数据
下载公共测序数据的另一种姿势(kingfisher) - 简书
生物信息常见文件的格式以及查看方式 | Public Library of Bioinformatics
高速快速下载基因组ref文件
查看文件下载是否完整
md5值检查文件完整性,因为原始文件 太大,需要检查数据完成性 md5值给每个文件一个独特的id,根据id是否相等来检查文件完整性cd 01raw_data/
md5sum *gz>md5.txt #给每个 gz文件都生成md5值,,并且输入到 md5.txt
ls
cat md5.txt #查看内容
md5sum -c md5.txt #比较文件自带的md5值和自己生成的md5值是否相等。若相等则文件相等。 -c参数,check一下
md5sum --help
查看fastq文件内容
-
# zcat查看gzip压缩的文件 # head -n 8 显示前8行文件内容(前8行代表2条序列) zcat filename.fq.gz | head -n 8
3 查看原始数据的质量情况 质控
conda activate rna
#当前目录下
fastqc S*gz
ls -lh
multiqc ./
fastq.gz为原始的测序数据
fastqc.zip 为fastqc质控时候产生的数据
似乎质量不行啊
通过前面这些步骤,已经初步判定数据是否合格。如果不合格,那如何修改?或者把不合格的数据扔掉?trimmgalore质控
4 trim_galore去接头(并行处理) - 简书
trimmgalore批量质控 双端数据
2 分离_1和_2文件
(wes) pc@lab-pc:/project/raw_fq$ ls|grep _1.fastq.gz>gz1
(wes) pc@lab-pc:/project/raw_fq$ ls|grep _2.fastq.gz>gz2
(wes) pc@lab-pc:/project/raw_fq$ paste gz1 gz2>config
(wes) pc@lab-pc:/project/raw_fq$ cat config|head
然后写一个脚本 用于 trimmgalore批量质控 双端数据 同时处理两个参数
$一定不要忘记加上
(rna) tree119 21:48:01 ~/pipeline/download/rnaseq/01_rawdata
$ cat trim.sh
outdir=~/pipeline/download/rnaseq/02cleandata/
cat config |while read id
do
arr=${id}
fq1=${arr[0]}
fq2=${arr[1]}
nohup trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $outdir $fq1 $fq2 &
done
运行即可
bash trim.sh
结果输入到了outdir文件夹下 (这里指的是02cleandata文件夹)
得到的质控过滤后的文件:
$ ls -lh *fq.gz|cut -d" " -f 5-
选取运行过程中其中一对结果展示如下
RUN STATISTICS FOR INPUT FILE: SRR11618782_1.fastq.gz
=============================================
343183 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)
Writing report to '/home/st8/ssd2/tree119/pipeline/download/rnaseq/01_rawdata/outdir/SRR11618782_2.fastq.gz_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11618782_2.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 1.18
Number of cores used for trimming: 1
Quality Phred score cutoff: 25
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
Output file(s) will be GZIP compressed
Cutadapt seems to be reasonably up-to-date. Setting -j -j 1
Writing final adapter and quality trimmed output to SRR11618782_2_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATA' from file SRR11618782_2.fastq.gz <<<
This is cutadapt 1.18 with Python 3.7.15
Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618610_2.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 2.45 s (18 us/read; 3.36 M reads/minute).
=== Summary ===
Total reads processed: 137,124
Reads with adapters: 4,643 (3.4%)
Reads written (passing filters): 137,124 (100.0%)
Total basepairs processed: 13,712,400 bp
Quality-trimmed: 305,580 bp (2.2%)
Total written (filtered): 13,368,511 bp (97.5%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 4643 times.
No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1
Bases preceding removed adapters:
A: 18.6%
C: 33.4%
G: 24.8%
T: 23.1%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 3150 2142.6 0 3150
4 577 535.6 0 577
5 174 133.9 0 174
6 39 33.5 0 39
7 12 8.4 0 12
8 15 2.1 0 15
9 15 0.5 0 13 2
10 29 0.1 1 13 16
11 3 0.0 1 2 1
12 10 0.0 1 8 2
13 11 0.0 1 11
14 13 0.0 1 11 2
15 16 0.0 1 12 4
16 15 0.0 1 10 5
17 8 0.0 1 6 2
18 6 0.0 1 5 1
19 12 0.0 1 11 1
20 10 0.0 1 6 4
21 12 0.0 1 11 1
22 9 0.0 1 8 1
23 16 0.0 1 13 3
24 8 0.0 1 6 2
25 10 0.0 1 9 1
26 7 0.0 1 6 1
27 16 0.0 1 14 2
28 5 0.0 1 4 1
29 11 0.0 1 9 2
30 26 0.0 1 19 7
31 32 0.0 1 29 3
32 10 0.0 1 9 1
33 22 0.0 1 13 9
34 11 0.0 1 11
35 17 0.0 1 9 8
36 7 0.0 1 3 4
37 3 0.0 1 3
38 20 0.0 1 19 1
39 14 0.0 1 13 1
40 7 0.0 1 4 3
41 22 0.0 1 20 2
42 39 0.0 1 37 2
43 5 0.0 1 4 1
44 15 0.0 1 15
45 7 0.0 1 5 2
46 12 0.0 1 9 3
47 1 0.0 1 0 1
48 8 0.0 1 7 1
49 6 0.0 1 6
50 2 0.0 1 1 1
51 3 0.0 1 3
52 3 0.0 1 0 3
53 5 0.0 1 3 2
54 2 0.0 1 0 2
55 1 0.0 1 1
57 8 0.0 1 5 3
58 8 0.0 1 7 1
59 1 0.0 1 1
60 11 0.0 1 8 3
61 26 0.0 1 24 2
62 5 0.0 1 4 1
63 6 0.0 1 3 3
64 8 0.0 1 7 1
65 1 0.0 1 0 1
67 3 0.0 1 1 2
68 3 0.0 1 3
69 1 0.0 1 0 1
70 1 0.0 1 0 1
72 3 0.0 1 0 3
73 1 0.0 1 0 1
74 3 0.0 1 0 3
77 14 0.0 1 2 12
78 1 0.0 1 1
79 5 0.0 1 0 5
81 5 0.0 1 0 5
84 6 0.0 1 1 5
86 2 0.0 1 0 2
88 2 0.0 1 0 2
90 3 0.0 1 0 3
91 2 0.0 1 1 1
92 2 0.0 1 1 1
93 1 0.0 1 0 1
95 1 0.0 1 0 1
98 1 0.0 1 0 1
RUN STATISTICS FOR INPUT FILE: SRR11618610_2.fastq.gz
=============================================
137124 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)
Validate paired-end files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
file_1: SRR11618610_1_trimmed.fq.gz, file_2: SRR11618610_2_trimmed.fq.gz
>>>>> Now validing the length of the 2 paired-end infiles: SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to SRR11618610_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to SRR11618610_2_val_2.fq.gz
Total number of sequences analysed: 137124
Number of sequence pairs removed because at least one read was shorter than the length cutoff (36 bp): 2597 (1.89%)
Deleting both intermediate output files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
====================================================================================================
This is cutadapt 1.18 with Python 3.7.15
Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618782_2.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 4.63 s (13 us/read; 4.44 M reads/minute).
=== Summary ===
Total reads processed: 343,183
Reads with adapters: 31,276 (9.1%)
Reads written (passing filters): 343,183 (100.0%)
Total basepairs processed: 34,318,300 bp
Quality-trimmed: 2,815,627 bp (8.2%)
Total written (filtered): 30,713,243 bp (89.5%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 31276 times.
No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1
Bases preceding removed adapters:
A: 17.2%
C: 45.5%
G: 19.7%
T: 17.6%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 6021 5362.2 0 6021
4 1325 1340.6 0 1325
5 542 335.1 0 542
6 372 83.8 0 372
7 358 20.9 0 358
8 348 5.2 0 348
9 291 1.3 0 283 8
10 386 0.3 1 281 105
11 384 0.1 1 305 79
12 490 0.0 1 387 103
13 388 0.0 1 331 57
14 512 0.0 1 406 106
15 371 0.0 1 323 48
16 392 0.0 1 307 85
17 449 0.0 1 370 79
18 354 0.0 1 272 82
19 517 0.0 1 422 95
20 440 0.0 1 371 69
21 380 0.0 1 294 86
22 310 0.0 1 249 61
23 220 0.0 1 199 21
24 291 0.0 1 255 36
25 604 0.0 1 510 94
26 370 0.0 1 306 64
27 489 0.0 1 411 78
28 314 0.0 1 265 49
29 600 0.0 1 491 109
30 432 0.0 1 382 50
31 1108 0.0 1 954 154
32 361 0.0 1 306 55
33 501 0.0 1 418 83
34 509 0.0 1 421 88
35 548 0.0 1 472 76
36 411 0.0 1 346 65
37 665 0.0 1 507 158
38 1168 0.0 1 1053 115
39 2363 0.0 1 2233 130
40 276 0.0 1 235 41
41 537 0.0 1 506 31
42 278 0.0 1 249 29
43 452 0.0 1 428 24
44 368 0.0 1 344 24
45 60 0.0 1 53 7
46 74 0.0 1 65 9
47 129 0.0 1 122 7
48 225 0.0 1 200 25
49 366 0.0 1 346 20
50 163 0.0 1 146 17
51 43 0.0 1 40 3
52 26 0.0 1 20 6
53 79 0.0 1 63 16
54 14 0.0 1 12 2
55 80 0.0 1 61 19
56 35 0.0 1 22 13
57 152 0.0 1 146 6
58 184 0.0 1 179 5
59 121 0.0 1 114 7
60 242 0.0 1 227 15
61 796 0.0 1 756 40
62 146 0.0 1 134 12
63 64 0.0 1 62 2
64 146 0.0 1 134 12
65 66 0.0 1 59 7
66 20 0.0 1 18 2
67 104 0.0 1 85 19
68 65 0.0 1 58 7
69 16 0.0 1 11 5
70 11 0.0 1 10 1
71 11 0.0 1 9 2
72 19 0.0 1 13 6
73 17 0.0 1 14 3
74 40 0.0 1 27 13
75 21 0.0 1 14 7
76 9 0.0 1 8 1
77 18 0.0 1 15 3
78 13 0.0 1 10 3
79 16 0.0 1 10 6
80 11 0.0 1 9 2
81 9 0.0 1 3 6
82 14 0.0 1 14
83 13 0.0 1 12 1
84 17 0.0 1 15 2
85 33 0.0 1 30 3
86 17 0.0 1 16 1
87 20 0.0 1 20
88 5 0.0 1 5
89 13 0.0 1 13
90 6 0.0 1 5 1
91 7 0.0 1 6 1
92 13 0.0 1 2 11
95 1 0.0 1 0 1
96 1 0.0 1 0 1
97 7 0.0 1 1 6
100 3 0.0 1 0 3
5.比对 align索引文件和质控好的测序数据进行比对
下载参考基因组 下载已经建立好索引的基因组
查看下载的基因组是否正确 查看下载内容时候有误
md5sum检查完整性 查看gtf文件是否完整
案例一
md5sum ./* >md5sum.txt
md5sum -c md5sum.txt
结果如下,说明文件是完整的
$ md5sum -c md5sum.txt
./CM000663.1: OK
./CM000664.1: OK
./CM000665.1: OK
./CM000666.1: OK
./CM000667.1: OK
./CM000668.1: OK
./CM000669.1: OK
./CM000670.1: OK
./CM000671.1: OK
./CM000672.1: OK
./CM000673.1: OK
./CM000674.1: OK
./CM000675.1: OK
./CM000676.1: OK
./CM000677.1: OK
./CM000678.1: OK
./CM000679.1: OK
./CM000680.1: OK
./CM000681.1: OK
./CM000682.1: OK
./CM000683.1: OK
./CM000684.1: OK
./CM000685.1: OK
./CM000686.1: OK
./GL000191.1: OK
./GL000192.1: OK
./GL000193.1: OK
./GL000194.1: OK
./GL000195.1: OK
./GL000196.1: OK
./GL000197.1: OK
./GL000198.1: OK
./GL000199.1: OK
./GL000200.1: OK
./GL000201.1: OK
./GL000202.1: OK
./GL000203.1: OK
./GL000204.1: OK
./GL000205.1: OK
./GL000206.1: OK
./GL000207.1: OK
./GL000208.1: OK
./GL000209.1: OK
./GL000210.1: OK
./GL000211.1: OK
./GL000212.1: OK
./GL000213.1: OK
./GL000214.1: OK
./GL000215.1: OK
./GL000216.1: OK
./GL000217.1: OK
./GL000218.1: OK
./GL000219.1: OK
./GL000220.1: OK
./GL000221.1: OK
./GL000222.1: OK
./GL000223.1: OK
./GL000224.1: OK
./GL000225.1: OK
./GL000226.1: OK
./GL000227.1: OK
./GL000228.1: OK
./GL000229.1: OK
./GL000230.1: OK
./GL000231.1: OK
./GL000232.1: OK
./GL000233.1: OK
./GL000234.1: OK
./GL000235.1: OK
./GL000236.1: OK
./GL000237.1: OK
./GL000238.1: OK
./GL000239.1: OK
./GL000240.1: OK
./GL000241.1: OK
./GL000242.1: OK
./GL000243.1: OK
./GL000244.1: OK
./GL000245.1: OK
./GL000246.1: OK
./GL000247.1: OK
./GL000248.1: OK
./GL000249.1: OK
./NC_012920.1: OK
./SRR11832836.sra: OK
md5sum *.gz >md5.txt
cat md5.txt
md5sum -c md5.txt
下载索引文件
wget -c
对基因组文件解压缩 去掉gz后缀
ls *gz |xargs gunzip
对索引文件解压缩
$ cd reference/index
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
# 解压得到两个目录,hg19和mm10
$ tar -zxvf *.tar.gz
得到8个索引
基因组注释文件(GFF,GTF)下载的五种方法_白墨石的博客-CSDN博客_基因组注释文件
hg38的gtf文件下载
genecode中参考基因组的名字 更新很快!
新建脚本来运行 制备文件准备工作
489 ls |grep _1.fq.gz
490 ls |grep _1.fq.gz >gz1
491 ls |grep _2.fq.gz >gz2
492 paste gz1 gz2 >file_for_align
493 cat file_for_align
新建脚本来运行 失败
vim align.sh
cat file_for_align |while read id
do
arr=${id}
fq1=${arr[0]}
fq2=${arr[1]}
nohup sh -c " hisat2 -p 2 -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome -1 ${fq1} -2 ${fq2} 2>${id%%_*}.log | samtools sort -@ 2 -o ${id%%_*}.bam " &
done
联合多人的代码之后,成功!
ls |grep .fq.gz|cut -d "_" -f 1|
while read id; do nohup sh -c " hisat2 -p 2
-x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome
-1 ${id}_1_val_1.fq.gz -2 ${id}_1_val_1.fq.gz
2>${id%%_*}.log |
samtools sort -@ 2 -o ${id%%_*}.bam " & done
RNA-seq(5):序列比对:Hisat2 - 简书 别人的代码
#启动miniconda3环境(hisat2所在的环境)
$ source ~/miniconda3/bin/activate
#进入data目录
$cd /mnt/f/rna_seq/aligned
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$
# 小鼠和人是分开各自比对自己的index
# 人的比对
$ for ((i=56;i<=58;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam ;done
# 小鼠比对
$ for ((i=59;i<=62;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/mm10/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam; done
#结果一共得到7个sam文件
genome指的是这里的genome.x.ht2的basename
4.定量 featurecounts
gtf=~/pipeline/download/rnaseq/00ref/genom_file/gencode.v41.annotation.gtf
nohup featureCounts -T 5 -p -t exon -g gene_id -a $gtf -o all.id.txt *bam 1>counts.id.log 2>&1 &