第四次考核 Jimmy 学徒考核 Linux安装软件 rnaseq上游分析-2 ascp kingfisher数据下载ena Linux高速下载 Linux下载网页内容 rna-seq上游Linux

生信小博士

已于 2024-06-24 10:02:20 修改

阅读量1.6k

点赞数 4

分类专栏：数据处理 dataframe tibble 数据清洗纸上得来终觉浅文章标签： linux 运维服务器

于 2022-12-04 23:20:31 首次发布

本文链接：https://blog.csdn.net/qq_52813185/article/details/128177352

版权

纸上得来终觉浅同时被 2 个专栏收录

117 篇文章 9 订阅

订阅专栏

数据处理 dataframe tibble 数据清洗

68 篇文章 27 订阅

订阅专栏

第四次考核 Jimmy 学徒考核 Linux安装软件 rnaseq上游分析_YoungLeelight的博客-CSDN博客添加命令 export PATH=$PATH:/path/to/bwa-0.7.12/bin # Add bwa to your PATH by editing ~/.bashrc file (or .bash_profile or .profile file) 在全新服务器配置转录组测序数据处理环境 1.2 配置rna转录组环境1.3 下载并且整理数据库文件数据下载也是老规矩，利用prefetch下载SRA数据，准备进行转录组数据分析。在GSE数据集对应网页，点击SRA_Rhttps://blog.csdn.net/qq_52813185/article/details/128172862?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22128172862%22%2C%22source%22%3A%22qq_52813185%22%7D

01-rna-seq从头开始卖萌哥 Linux生信技能树Linux安装软件 Linux实战RNASEQ上游_YoungLeelight的博客-CSDN博客

1.首先构建项目所需目录，这样比较哦清晰。共分为四个目录，

2.原始数据上传测序数据
或者自己通过kingfisher下载sra原始数据或者 fastq.gz数据

下载公共测序数据的另一种姿势（kingfisher） - 简书

生物信息常见文件的格式以及查看方式 | Public Library of Bioinformatics

高速快速下载基因组ref文件

查看文件下载是否完整

md5值检查文件完整性，因为原始文件太大，需要检查数据完成性 md5值给每个文件一个独特的id，根据id是否相等来检查文件完整性cd 01raw_data/

 md5sum *gz>md5.txt     #给每个 gz文件都生成md5值，，并且输入到 md5.txt 
     ls
 cat md5.txt          #查看内容

 md5sum -c md5.txt   #比较文件自带的md5值和自己生成的md5值是否相等。若相等则文件相等。  -c参数，check一下
md5sum --help

查看fastq文件内容

# zcat查看gzip压缩的文件

# head -n 8 显示前8行文件内容（前8行代表2条序列）


zcat filename.fq.gz | head -n 8

3 查看原始数据的质量情况质控

conda activate rna
#当前目录下
fastqc S*gz
ls -lh



multiqc ./

fastq.gz为原始的测序数据

fastqc.zip 为fastqc质控时候产生的数据

似乎质量不行啊

通过前面这些步骤，已经初步判定数据是否合格。如果不合格，那如何修改？或者把不合格的数据扔掉？trimmgalore质控

4 trim_galore去接头(并行处理) - 简书

trimmgalore批量质控双端数据

2 分离_1和_2文件

(wes) pc@lab-pc:/project/raw_fq$ ls|grep _1.fastq.gz>gz1
(wes) pc@lab-pc:/project/raw_fq$ ls|grep _2.fastq.gz>gz2
(wes) pc@lab-pc:/project/raw_fq$ paste gz1 gz2>config
(wes) pc@lab-pc:/project/raw_fq$ cat config|head

然后写一个脚本用于 trimmgalore批量质控双端数据同时处理两个参数

$一定不要忘记加上

(rna) tree119 21:48:01 ~/pipeline/download/rnaseq/01_rawdata
$ cat trim.sh 
outdir=~/pipeline/download/rnaseq/02cleandata/


cat config |while read id
do
	arr=${id}
	fq1=${arr[0]}
	fq2=${arr[1]}

	nohup trim_galore  -q 25 --phred33 --length 36 --stringency 3 --paired  -o $outdir  $fq1 $fq2 & 
done

运行即可

bash trim.sh

结果输入到了outdir文件夹下（这里指的是02cleandata文件夹）

得到的质控过滤后的文件：

$ ls -lh  *fq.gz|cut -d" " -f 5-

选取运行过程中其中一对结果展示如下

RUN STATISTICS FOR INPUT FILE: SRR11618782_1.fastq.gz
=============================================
343183 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Writing report to '/home/st8/ssd2/tree119/pipeline/download/rnaseq/01_rawdata/outdir/SRR11618782_2.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11618782_2.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 1.18
Number of cores used for trimming: 1
Quality Phred score cutoff: 25
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
Output file(s) will be GZIP compressed

Cutadapt seems to be reasonably up-to-date. Setting -j -j 1
Writing final adapter and quality trimmed output to SRR11618782_2_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATA' from file SRR11618782_2.fastq.gz <<< 
This is cutadapt 1.18 with Python 3.7.15
Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618610_2.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 2.45 s (18 us/read; 3.36 M reads/minute).

=== Summary ===

Total reads processed:                 137,124
Reads with adapters:                     4,643 (3.4%)
Reads written (passing filters):       137,124 (100.0%)

Total basepairs processed:    13,712,400 bp
Quality-trimmed:                 305,580 bp (2.2%)
Total written (filtered):     13,368,511 bp (97.5%)

=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 4643 times.

No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters:
  A: 18.6%
  C: 33.4%
  G: 24.8%
  T: 23.1%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	3150	2142.6	0	3150
4	577	535.6	0	577
5	174	133.9	0	174
6	39	33.5	0	39
7	12	8.4	0	12
8	15	2.1	0	15
9	15	0.5	0	13 2
10	29	0.1	1	13 16
11	3	0.0	1	2 1
12	10	0.0	1	8 2
13	11	0.0	1	11
14	13	0.0	1	11 2
15	16	0.0	1	12 4
16	15	0.0	1	10 5
17	8	0.0	1	6 2
18	6	0.0	1	5 1
19	12	0.0	1	11 1
20	10	0.0	1	6 4
21	12	0.0	1	11 1
22	9	0.0	1	8 1
23	16	0.0	1	13 3
24	8	0.0	1	6 2
25	10	0.0	1	9 1
26	7	0.0	1	6 1
27	16	0.0	1	14 2
28	5	0.0	1	4 1
29	11	0.0	1	9 2
30	26	0.0	1	19 7
31	32	0.0	1	29 3
32	10	0.0	1	9 1
33	22	0.0	1	13 9
34	11	0.0	1	11
35	17	0.0	1	9 8
36	7	0.0	1	3 4
37	3	0.0	1	3
38	20	0.0	1	19 1
39	14	0.0	1	13 1
40	7	0.0	1	4 3
41	22	0.0	1	20 2
42	39	0.0	1	37 2
43	5	0.0	1	4 1
44	15	0.0	1	15
45	7	0.0	1	5 2
46	12	0.0	1	9 3
47	1	0.0	1	0 1
48	8	0.0	1	7 1
49	6	0.0	1	6
50	2	0.0	1	1 1
51	3	0.0	1	3
52	3	0.0	1	0 3
53	5	0.0	1	3 2
54	2	0.0	1	0 2
55	1	0.0	1	1
57	8	0.0	1	5 3
58	8	0.0	1	7 1
59	1	0.0	1	1
60	11	0.0	1	8 3
61	26	0.0	1	24 2
62	5	0.0	1	4 1
63	6	0.0	1	3 3
64	8	0.0	1	7 1
65	1	0.0	1	0 1
67	3	0.0	1	1 2
68	3	0.0	1	3
69	1	0.0	1	0 1
70	1	0.0	1	0 1
72	3	0.0	1	0 3
73	1	0.0	1	0 1
74	3	0.0	1	0 3
77	14	0.0	1	2 12
78	1	0.0	1	1
79	5	0.0	1	0 5
81	5	0.0	1	0 5
84	6	0.0	1	1 5
86	2	0.0	1	0 2
88	2	0.0	1	0 2
90	3	0.0	1	0 3
91	2	0.0	1	1 1
92	2	0.0	1	1 1
93	1	0.0	1	0 1
95	1	0.0	1	0 1
98	1	0.0	1	0 1


RUN STATISTICS FOR INPUT FILE: SRR11618610_2.fastq.gz
=============================================
137124 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Validate paired-end files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz
file_1: SRR11618610_1_trimmed.fq.gz, file_2: SRR11618610_2_trimmed.fq.gz


>>>>> Now validing the length of the 2 paired-end infiles: SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to SRR11618610_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to SRR11618610_2_val_2.fq.gz

Total number of sequences analysed: 137124

Number of sequence pairs removed because at least one read was shorter than the length cutoff (36 bp): 2597 (1.89%)

Deleting both intermediate output files SRR11618610_1_trimmed.fq.gz and SRR11618610_2_trimmed.fq.gz

====================================================================================================

This is cutadapt 1.18 with Python 3.7.15
Command line parameters: -j 1 -e 0.1 -q 25 -O 3 -a CTGTCTCTTATA SRR11618782_2.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 4.63 s (13 us/read; 4.44 M reads/minute).

=== Summary ===

Total reads processed:                 343,183
Reads with adapters:                    31,276 (9.1%)
Reads written (passing filters):       343,183 (100.0%)

Total basepairs processed:    34,318,300 bp
Quality-trimmed:               2,815,627 bp (8.2%)
Total written (filtered):     30,713,243 bp (89.5%)

=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 31276 times.

No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters:
  A: 17.2%
  C: 45.5%
  G: 19.7%
  T: 17.6%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	6021	5362.2	0	6021
4	1325	1340.6	0	1325
5	542	335.1	0	542
6	372	83.8	0	372
7	358	20.9	0	358
8	348	5.2	0	348
9	291	1.3	0	283 8
10	386	0.3	1	281 105
11	384	0.1	1	305 79
12	490	0.0	1	387 103
13	388	0.0	1	331 57
14	512	0.0	1	406 106
15	371	0.0	1	323 48
16	392	0.0	1	307 85
17	449	0.0	1	370 79
18	354	0.0	1	272 82
19	517	0.0	1	422 95
20	440	0.0	1	371 69
21	380	0.0	1	294 86
22	310	0.0	1	249 61
23	220	0.0	1	199 21
24	291	0.0	1	255 36
25	604	0.0	1	510 94
26	370	0.0	1	306 64
27	489	0.0	1	411 78
28	314	0.0	1	265 49
29	600	0.0	1	491 109
30	432	0.0	1	382 50
31	1108	0.0	1	954 154
32	361	0.0	1	306 55
33	501	0.0	1	418 83
34	509	0.0	1	421 88
35	548	0.0	1	472 76
36	411	0.0	1	346 65
37	665	0.0	1	507 158
38	1168	0.0	1	1053 115
39	2363	0.0	1	2233 130
40	276	0.0	1	235 41
41	537	0.0	1	506 31
42	278	0.0	1	249 29
43	452	0.0	1	428 24
44	368	0.0	1	344 24
45	60	0.0	1	53 7
46	74	0.0	1	65 9
47	129	0.0	1	122 7
48	225	0.0	1	200 25
49	366	0.0	1	346 20
50	163	0.0	1	146 17
51	43	0.0	1	40 3
52	26	0.0	1	20 6
53	79	0.0	1	63 16
54	14	0.0	1	12 2
55	80	0.0	1	61 19
56	35	0.0	1	22 13
57	152	0.0	1	146 6
58	184	0.0	1	179 5
59	121	0.0	1	114 7
60	242	0.0	1	227 15
61	796	0.0	1	756 40
62	146	0.0	1	134 12
63	64	0.0	1	62 2
64	146	0.0	1	134 12
65	66	0.0	1	59 7
66	20	0.0	1	18 2
67	104	0.0	1	85 19
68	65	0.0	1	58 7
69	16	0.0	1	11 5
70	11	0.0	1	10 1
71	11	0.0	1	9 2
72	19	0.0	1	13 6
73	17	0.0	1	14 3
74	40	0.0	1	27 13
75	21	0.0	1	14 7
76	9	0.0	1	8 1
77	18	0.0	1	15 3
78	13	0.0	1	10 3
79	16	0.0	1	10 6
80	11	0.0	1	9 2
81	9	0.0	1	3 6
82	14	0.0	1	14
83	13	0.0	1	12 1
84	17	0.0	1	15 2
85	33	0.0	1	30 3
86	17	0.0	1	16 1
87	20	0.0	1	20
88	5	0.0	1	5
89	13	0.0	1	13
90	6	0.0	1	5 1
91	7	0.0	1	6 1
92	13	0.0	1	2 11
95	1	0.0	1	0 1
96	1	0.0	1	0 1
97	7	0.0	1	1 6
100	3	0.0	1	0 3

5.比对 align索引文件和质控好的测序数据进行比对

下载参考基因组下载已经建立好索引的基因组

在全新服务器配置转录组测序数据处理环境

查看下载的基因组是否正确查看下载内容时候有误

md5sum检查完整性查看gtf文件是否完整

案例一

md5sum ./* >md5sum.txt

md5sum -c md5sum.txt

结果如下，说明文件是完整的

$ md5sum -c md5sum.txt
./CM000663.1: OK
./CM000664.1: OK
./CM000665.1: OK
./CM000666.1: OK
./CM000667.1: OK
./CM000668.1: OK
./CM000669.1: OK
./CM000670.1: OK
./CM000671.1: OK
./CM000672.1: OK
./CM000673.1: OK
./CM000674.1: OK
./CM000675.1: OK
./CM000676.1: OK
./CM000677.1: OK
./CM000678.1: OK
./CM000679.1: OK
./CM000680.1: OK
./CM000681.1: OK
./CM000682.1: OK
./CM000683.1: OK
./CM000684.1: OK
./CM000685.1: OK
./CM000686.1: OK
./GL000191.1: OK
./GL000192.1: OK
./GL000193.1: OK
./GL000194.1: OK
./GL000195.1: OK
./GL000196.1: OK
./GL000197.1: OK
./GL000198.1: OK
./GL000199.1: OK
./GL000200.1: OK
./GL000201.1: OK
./GL000202.1: OK
./GL000203.1: OK
./GL000204.1: OK
./GL000205.1: OK
./GL000206.1: OK
./GL000207.1: OK
./GL000208.1: OK
./GL000209.1: OK
./GL000210.1: OK
./GL000211.1: OK
./GL000212.1: OK
./GL000213.1: OK
./GL000214.1: OK
./GL000215.1: OK
./GL000216.1: OK
./GL000217.1: OK
./GL000218.1: OK
./GL000219.1: OK
./GL000220.1: OK
./GL000221.1: OK
./GL000222.1: OK
./GL000223.1: OK
./GL000224.1: OK
./GL000225.1: OK
./GL000226.1: OK
./GL000227.1: OK
./GL000228.1: OK
./GL000229.1: OK
./GL000230.1: OK
./GL000231.1: OK
./GL000232.1: OK
./GL000233.1: OK
./GL000234.1: OK
./GL000235.1: OK
./GL000236.1: OK
./GL000237.1: OK
./GL000238.1: OK
./GL000239.1: OK
./GL000240.1: OK
./GL000241.1: OK
./GL000242.1: OK
./GL000243.1: OK
./GL000244.1: OK
./GL000245.1: OK
./GL000246.1: OK
./GL000247.1: OK
./GL000248.1: OK
./GL000249.1: OK
./NC_012920.1: OK
./SRR11832836.sra: OK

 md5sum *.gz >md5.txt
cat md5.txt
md5sum -c md5.txt

下载索引文件

Hisat2官网上人类基因组索引的下载 - 简书

wget -c

在全新服务器配置转录组测序数据处理环境

对基因组文件解压缩去掉gz后缀

ls *gz |xargs gunzip

RNA-seq(5):序列比对：Hisat2 - 简书

对索引文件解压缩

$ cd reference/index
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
# 解压得到两个目录，hg19和mm10
$ tar -zxvf *.tar.gz

得到8个索引

基因组注释文件(GFF,GTF)下载的五种方法_白墨石的博客-CSDN博客_基因组注释文件

hg38的gtf文件下载

genecode中参考基因组的名字更新很快！

新建脚本来运行 制备文件准备工作

489  ls |grep _1.fq.gz
  490  ls |grep _1.fq.gz >gz1
  491  ls |grep _2.fq.gz >gz2
  492  paste gz1 gz2 >file_for_align
  493  cat file_for_align

新建脚本来运行失败

vim align.sh


cat file_for_align |while read id
do
	 arr=${id}
	 fq1=${arr[0]}
	 fq2=${arr[1]}
	 

	 nohup sh -c " hisat2 -p 2 -x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome  -1 ${fq1} -2 ${fq2}  2>${id%%_*}.log  | samtools sort -@ 2 -o ${id%%_*}.bam  " & 

done

联合多人的代码之后，成功！

ls |grep  .fq.gz|cut -d "_" -f 1|
while read id; do nohup sh -c " hisat2 -p 2 
-x ~/pipeline/download/rnaseq/00ref/index_file/grch38/genome  
-1 ${id}_1_val_1.fq.gz  -2 ${id}_1_val_1.fq.gz 
 2>${id%%_*}.log  | 
samtools sort -@ 2 -o ${id%%_*}.bam  " & done

RNA-seq(5):序列比对：Hisat2 - 简书别人的代码

#启动miniconda3环境（hisat2所在的环境）
$ source ~/miniconda3/bin/activate
#进入data目录
$cd /mnt/f/rna_seq/aligned
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$

# 小鼠和人是分开各自比对自己的index
# 人的比对
$ for ((i=56;i<=58;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam ;done
# 小鼠比对
$ for ((i=59;i<=62;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/mm10/genome -1 /mnt/f/rna_seq/data/SRR35899${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam; done
#结果一共得到7个sam文件

genome指的是这里的genome.x.ht2的basename

4.定量 featurecounts

featurecounts的使用说明 - 简书

gtf=~/pipeline/download/rnaseq/00ref/genom_file/gencode.v41.annotation.gtf

nohup featureCounts -T 5 -p -t exon -g gene_id  -a $gtf -o  all.id.txt  *bam  1>counts.id.log 2>&1 &