A. 开发环境及数据准备工作
A1. 开发环境搭建
安装miniconda3,然后在其中创建WGS的开发环境:
# 1. Install miniconda3.
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh .
bash Miniconda3-latest-Linux-x86_64.sh -b -p /home/xuquan/conda/miniconda3
echo "export PATH=/home/xuquan/conda/miniconda3/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
# 2. Config.
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes
# 3. Create QxWgsEnv!
conda create -n wgs python=2
source activate wgs
至此,已经创建了WGS数据分析流程的conda开发环境
A2. 准备测试数据
下载参考 E.coli K12 参考基因组:
mkdir /home/xuquan/conda/db
cd /home/xuquan/conda/db
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gzip -dc GCF_000005845.2_ASM584v2_genomic.fna.gz > E.coli_K12_MG1655.fa
less E.coli_K12_MG1655.fa # E.coli has only one chromesome!
下载测序数据:
(wgs) [xuquan@zhswlcserver:conda]$mkdir rawdata
(wgs) [xuquan@zhswlcserver:conda]$cd rawdata/
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 0
(wgs) [xuquan@zhswlcserver:rawdata]$wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413/SRR1770413.sra
--2019-02-14 09:00:47-- ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413/SRR1770413.sra
=> ‘SRR1770413.sra’
Resolving ftp-trace.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::12
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413 ... done.
==> SIZE SRR1770413.sra ... 199084896
==> PASV ... done. ==> RETR SRR1770413.sra ... done.
Length: 199084896 (190M) (unauthoritative)
SRR1770413.sra 100%[========================================================================================>] 189.86M 1.57MB/s in 76s
2019-02-14 09:02:07 (2.51 MB/s) - ‘SRR1770413.sra’ saved [199084896]
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 194420
-rw-rw-r-- 1 xuquan xuquan 199084896 2月 14 09:02 SRR1770413.sra
A3. 安装samtools
(wgs) [xuquan@zhswlcserver:db]$conda install samtools
遗憾的是,在安装了samtools,并尝试使用其时发现,报错了:
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
意思很明显,libcrypto.so.1.0.0 这个包找不到了!用ldd命令看看samtools依赖了哪些包:
(wgs) [xuquan@zhswlcserver:db]$which samtools
~/conda/miniconda3/envs/wgs/bin/samtools
(wgs) [xuquan@zhswlcserver:db]$ldd ~/conda/miniconda3/envs/wgs/bin/samtools
linux-vdso.so.1 => (0x00007ffc4fbf9000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5c59071000)
libz.so.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libz.so.1 (0x00007f5c59348000)
libm.so.6 => /lib64/libm.so.6 (0x00007f5c58d6f000)
libbz2.so.1.0 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libbz2.so.1.0 (0x00007f5c59334000)
liblzma.so.5 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/liblzma.so.5 (0x00007f5c5930a000)
libdeflate.so => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libdeflate.so (0x00007f5c592fa000)
libcurl.so.4 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libcurl.so.4 (0x00007f5c58cf6000)
libcrypto.so.1.0.0 => not found
libncursesw.so.6 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libncursesw.so.6 (0x00007f5c592bd000)
libtinfow.so.6 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libtinfow.so.6 (0x00007f5c58cb9000)
libc.so.6 => /lib64/libc.so.6 (0x00007f5c588ec000)
/lib64/ld-linux-x86-64.so.2 (0x00007f5c5928d000)
librt.so.1 => /lib64/librt.so.1 (0x00007f5c586e4000)
libssl.so.1.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/./libssl.so.1.1 (0x00007f5c5864f000)
libcrypto.so.1.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/./libcrypto.so.1.1 (0x00007f5c58358000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f5c58154000)
果然,libcrypto.so.1.0.0 包是 not found!
于是,尝试在系统中搜索该包:
(wgs) [xuquan@zhswlcserver:conda]$find / -name libcrypto.so.1*
find: ‘/sys/kernel/debug’: 权限不够
find: ‘/database/backup/root/DBI-1.633’: 权限不够
find: ‘/database/backup/root/VirtualBox VMs/qiime/Snapshots’: 权限不够
find: ‘/database/backup/root/VirtualBox VMs/qiime/Logs’: 权限不够
find: ‘/usr/libexec/initscripts/legacy-actions/auditd’: 权限不够
find: ‘/usr/share/rhel/secrets’: 权限不够
find: ‘/usr/share/Pegasus/scripts’: 权限不够
find: ‘/usr/share/polkit-1/rules.d’: 权限不够
/usr/lib64/libcrypto.so.1.0.2k
find: ‘/usr/lib64/Pegasus’: 权限不够
...
发现了 /usr/lib64/libcrypto.so.1.0.2k 这个包,接着使用软连接的形式连接到conda环境中去:
(wgs) [xuquan@zhswlcserver:conda]$ln -s /usr/lib64/libcrypto.so.1.0.2k /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libcrypto.so.1.0.0
(wgs) [xuquan@zhswlcserver:conda]$samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
samtools 已经运行成功了,说明问题已经得到解决了!
接着使用samtools对下载的参考基因组序列建一个索引,为方便其他数据分析工具(比如GATK)能够快速地获取fasta上的任何序列做准备。
(wgs) [xuquan@zhswlcserver:db]$cd ~/conda/db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5940
-rw-rw-r-- 1 xuquan xuquan 4699745 2月 13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月 13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
(wgs) [xuquan@zhswlcserver:db]$samtools faidx E.coli_K12_MG1655.fa
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5944
-rw-rw-r-- 1 xuquan xuquan 4699745 2月 13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 29 2月 14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1379902 2月 13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
除了方便其他工具之外,我们可以通过这样的索引来获取fasta文件中任意位置的序列或者任意完整的染色体序列。可以很方便地完成对参考序列(或者任意fasta文件)特定区域序列的提取。举个例子:
(wgs) [xuquan@zhswlcserver:db]$samtools faidx E.coli_K12_MG1655.fa NC_000913.3:1000000-1000200
我们就获得了E.coli K12参考序列上的这一段序列:
>NC_000913.3:1000000-1000200
GTGTCAGCTTTCGTGGTGTGCAGCTGGCGTCAGATGACAACATGCTGCCAGACAGCCTGA
AAGGGTTTGCGCCTGTGGTGCGTGGTATCGCCAAAAGCAATGCCCAGATAACGATTAAGC
AAAATGGTTACACCATTTACCAAACTTATGTATCGCCTGGTGCTTTTGAAATTAGTGATC
TCTATTCCACGTCGTCGAGCG
A4. 安装 sratoolkit
从该网站中找到对应的版本进行下载:
(wgs) [xuquan@zhswlcserver:conda]$mkdir softwares
(wgs) [xuquan@zhswlcserver:conda]$cd softwares/
(wgs) [xuquan@zhswlcserver:softwares]$wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
--2019-02-14 09:07:48-- http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
Resolving ftp-trace.ncbi.nlm.nih.gov... 130.14.250.13, 2607:f220:41e:250::11
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.13|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz [following]
--2019-02-14 09:07:49-- https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 107119239 (102M) [application/x-gzip]
Saving to: ‘sratoolkit.2.9.4-centos_linux64.tar.gz’
sratoolkit.2.9.4-centos_linux64.tar.gz 100%[========================================================================================>] 102.16M 1.72MB/s in 44s
2019-02-14 09:08:35 (2.30 MB/s) - ‘sratoolkit.2.9.4-centos_linux64.tar.gz’ saved [107119239/107119239]
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 104612
-rw-rw-r-- 1 xuquan xuquan 107119239 2月 6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$tar zxvf sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 104612
drwxrwxr-x 5 xuquan xuquan 149 2月 6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月 6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:bin]$echo "export PATH=/home/xuquan/conda/softwares/sratoolkit.2.9.4-centos_linux64/bin:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:bin]$source ~/.bashrc
至此,sratoolkit 已经安装并配置好了,接下来对下载的 SRA 数据转化为 fastq 文件:
(wgs) [xuquan@zhswlcserver:rawdata]$cd ~/conda/rawdata
(wgs) [xuquan@zhswlcserver:rawdata]$fastq-dump --split-files SRR1770413.sra
Read 643253 spots for SRR1770413.sra
Written 643253 spots for SRR1770413.sra
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 1045364
-rw-rw-r-- 1 xuquan xuquan 435681114 2月 14 09:15 SRR1770413_1.fastq
-rw-rw-r-- 1 xuquan xuquan 435681114 2月 14 09:15 SRR1770413_2.fastq
-rw-rw-r-- 1 xuquan xuquan 199084896 2月 14 09:02 SRR1770413.sra
至此,得到这个E.coli K12数据的 read1 和 read2 了!
实际上,fastq-dump 除了可以将已经下载到本地的SRA格式的数据转化为fastq格式的数据外,还可以直接从NCBI数据库下载并转化数据:
fastq-dump --split-files SRR1770413
A5. 安装HTSlib的相关工具
安装该软件的目的主要是为了利用 bgzip,htsfile,tabix等工具:
(wgs) [xuquan@zhswlcserver:softwares]$wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
(wgs) [xuquan@zhswlcserver:softwares]$tar jxvf htslib-1.9.tar.bz2
(wgs) [xuquan@zhswlcserver:softwares]$mkdir htslib-1.9-install
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$cd htslib-1.9
(wgs) [xuquan@zhswlcserver:htslib-1.9]$./configure --prefix=/home/xuquan/conda/softwares/htslib-1.9-install
(wgs) [xuquan@zhswlcserver:htslib-1.9]$make
(wgs) [xuquan@zhswlcserver:htslib-1.9]$make install
(wgs) [xuquan@zhswlcserver:htslib-1.9]$cd ../htslib-1.9-install/
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$ll
总用量 0
drwxr-xr-x 2 xuquan xuquan 60 2月 14 09:26 bin
drwxr-xr-x 3 xuquan xuquan 27 2月 14 09:26 include
drwxr-xr-x 3 xuquan xuquan 115 2月 14 09:26 lib
drwxrwxr-x 3 xuquan xuquan 24 2月 14 09:26 share
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$cd bin/
(wgs) [xuquan@zhswlcserver:bin]$ll
总用量 7924
-rwxr-xr-x 1 xuquan xuquan 2398296 2月 14 09:26 bgzip
-rwxr-xr-x 1 xuquan xuquan 2820880 2月 14 09:26 htsfile
-rwxr-xr-x 1 xuquan xuquan 2888072 2月 14 09:26 tabix
(wgs) [xuquan@zhswlcserver:bin]$pwd
/home/xuquan/conda/softwares/htslib-1.9-install/bin
(wgs) [xuquan@zhswlcserver:bin]$echo "export PATH=/home/xuquan/conda/softwares/htslib-1.9-install/bin:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:bin]$source ~/.bashrc
至此,htslib-1.9 已经安装并配置完成了,可以直接通过 bgzip 等命令使用相应的工具了:
(wgs) [xuquan@zhswlcserver:rawdata]$cd /home/xuquan/conda/rawdata
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 1045364
-rw-rw-r-- 1 xuquan xuquan 435681114 2月 14 09:15 SRR1770413_1.fastq
-rw-rw-r-- 1 xuquan xuquan 435681114 2月 14 09:15 SRR1770413_2.fastq
-rw-rw-r-- 1 xuquan xuquan 199084896 2月 14 09:02 SRR1770413.sra
(wgs) [xuquan@zhswlcserver:rawdata]$bgzip -f SRR1770413_1.fastq
(wgs) [xuquan@zhswlcserver:rawdata]$bgzip -f SRR1770413_2.fastq
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 453048
-rw-rw-r-- 1 xuquan xuquan 125123648 2月 14 09:34 SRR1770413_1.fastq.gz
-rw-rw-r-- 1 xuquan xuquan 139709727 2月 14 09:36 SRR1770413_2.fastq.gz
-rw-rw-r-- 1 xuquan xuquan 199084896 2月 14 09:02 SRR1770413.sra
至此,利用 bgzip 完成了对两个fastq文件的压缩:
下载完成后,我们最好用bgzip(不推荐gzip)将其压缩为.gz文件,这样可以节省空间,而且也不会对接下来的数据分析产生影响。
B. 数据质控
暂时跳过,后面再补
C. 序列比对
所谓比对,就是把测序数据定位到参考基因组上,确定每一个read在基因组中的位置。这里,我们用目前使用最广泛的BWA来完成这个工作。所以,首先,安装BWA!
C1. 安装 BWA
(wgs) [xuquan@zhswlcserver:conda]$conda install bwa
(wgs) [xuquan@zhswlcserver:conda]$which bwa
~/conda/miniconda3/envs/wgs/bin/bwa
(wgs) [xuquan@zhswlcserver:conda]$bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
至此,BWA已经安装成功了,接下来可以使用了!
C2. 构建比对索引
在正式比对之前,需要先为参考序列构建BWA比对所需的FM-index(比对索引):
(wgs) [xuquan@zhswlcserver:conda]$cd ~/conda/db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5944
-rw-rw-r-- 1 xuquan xuquan 4699745 2月 13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 29 2月 14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1379902 2月 13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
(wgs) [xuquan@zhswlcserver:db]$bwa index E.coli_K12_MG1655.fa
[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.81 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.04 sec
[bwa_index] Construct SA from BWT and Occ... 0.68 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index E.coli_K12_MG1655.fa
[main] Real time: 2.652 sec; CPU: 2.613 sec
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 13892
-rw-rw-r-- 1 xuquan xuquan 4699745 2月 13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 12 2月 14 09:50 E.coli_K12_MG1655.fa.amb
-rw-rw-r-- 1 xuquan xuquan 98 2月 14 09:50 E.coli_K12_MG1655.fa.ann
-rw-rw-r-- 1 xuquan xuquan 4641732 2月 14 09:50 E.coli_K12_MG1655.fa.bwt
-rw-rw-r-- 1 xuquan xuquan 29 2月 14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1160415 2月 14 09:50 E.coli_K12_MG1655.fa.pac
-rw-rw-r-- 1 xuquan xuquan 2320880 2月 14 09:50 E.coli_K12_MG1655.fa.sa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月 13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
由于这里使用的参考基因组序列较短,只需几秒钟就可以完成这个索引文件的构建(对于人类基因组则需要3个小时的时间)。创建完成之后,将多出5份以E.coli_K12_MG1655.fa为前缀的序列索引文件。
C3. 序列比对
(wgs) [xuquan@zhswlcserver:conda]$cd ~/conda
(wgs) [xuquan@zhswlcserver:conda]$mkdir output
(wgs) [xuquan@zhswlcserver:conda]$bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz > /home/xuquan/conda/output/SRR1770413.aln.sam
...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -R @RG\tID:foo\tPL:illumina\tSM:E.coli_K12 /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz
[main] Real time: 78.611 sec; CPU: 312.660 sec
-R 设置Read Group信息,虽然“我”在以前的文章中已经反复强调过它的重要性,但这里还是再说一次,它是read数据的组别标识,并且其中的ID,PL和SM信息在正式的项目中是不能缺少的(如果样本包含多个测序文库的话,LB信息也不要省略),另外由于考虑到与GATK的兼容关系,PL(测序平台)信息不能随意指定,必须是:ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS或UNKNOWN这12个中的一个。
至此,通过BWA mem完成了序列比对,生成了比对后的sam格式的数据文件:
[xuquan@zhswlcserver:output]$cd ~/conda/output/
[xuquan@zhswlcserver:output]$ll
总用量 892060
-rw-rw-r-- 1 xuquan xuquan 913468760 2月 14 10:02 SRR1770413.aln.sam
[xuquan@zhswlcserver:output]$ll -lh
总用量 872M
-rw-rw-r-- 1 xuquan xuquan 872M 2月 14 10:02 SRR1770413.aln.sam
接着使用samtools转化为BAM格式:
(wgs) [xuquan@zhswlcserver:conda]$samtools view -Sb /home/xuquan/conda/output/SRR1770413.aln.sam > /home/xuquan/conda/output/SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:conda]$cd output/
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 1170228
-rw-rw-r-- 1 xuquan xuquan 284843215 2月 14 10:08 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 913468760 2月 14 10:02 SRR1770413.aln.sam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 1.2G
-rw-rw-r-- 1 xuquan xuquan 272M 2月 14 10:08 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 872M 2月 14 10:02 SRR1770413.aln.sam
实际上,生成的sam格式文件在后面不会被用到,而且由于其文件大小较大,因此可以将两步合并为一步:
(wgs) [xuquan@zhswlcserver:output]$bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz | samtools view -Sb - > /home/xuquan/conda/output/SRR1770413.aln.bam
...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -R @RG\tID:foo\tPL:illumina\tSM:E.coli_K12 /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz
[main] Real time: 80.262 sec; CPU: 320.189 sec
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 278168
-rw-rw-r-- 1 xuquan xuquan 284843215 2月 14 10:12 SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 272M
-rw-rw-r-- 1 xuquan xuquan 272M 2月 14 10:12 SRR1770413.aln.bam
这样,最后就只产生一个bam格式的比对结果文件了!
C5. 比对结果排序
使用samtools对原始的比对结果按照参考序列位置从小到大进行排序,设置为使用4个线程进行排序。只有这个步骤完成之后才可以继续往下进行!
(wgs) [xuquan@zhswlcserver:output]$samtools sort -@ 4 -m 4G -O bam -o /home/xuquan/conda/output/SRR1770413.aln.sorted.bam /home/xuquan/conda/output/SRR1770413.aln.bam
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 480932
-rw-rw-r-- 1 xuquan xuquan 284843215 2月 14 10:12 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 207629239 2月 14 10:16 SRR1770413.aln.sorted.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 470M
-rw-rw-r-- 1 xuquan xuquan 272M 2月 14 10:12 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 199M 2月 14 10:16 SRR1770413.aln.sorted.bam
前一步比对产生的bam文件这时候也可以删除了:
(wgs) [xuquan@zhswlcserver:output]$rm -rf /home/xuquan/conda/output/SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 199M
-rw-rw-r-- 1 xuquan xuquan 199M 2月 14 10:16 SRR1770413.aln.sorted.bam
C6. 安装GATK
由于下一步需要使用到GATK的部分功能,因此这里安装GATK:
(wgs) [xuquan@zhswlcserver:conda]$cd softwares/
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 105772
drwxrwxr-x 7 xuquan xuquan 4096 2月 14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan 72 2月 14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan 1178859 7月 18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan 149 2月 6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月 6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$wget https://github.com/broadinstitute/gatk/releases/download/4.1.0.0/gatk-4.1.0.0.zip
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 491960
-rw-rw-r-- 1 xuquan xuquan 395455082 1月 30 11:33 gatk-4.1.0.0.zip
drwxrwxr-x 7 xuquan xuquan 4096 2月 14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan 72 2月 14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan 1178859 7月 18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan 149 2月 6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月 6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$unzip gatk-4.1.0.0.zip
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 491964
drwxr-xr-x 4 xuquan xuquan 4096 1月 29 22:23 gatk-4.1.0.0
-rw-rw-r-- 1 xuquan xuquan 395455082 1月 30 11:33 gatk-4.1.0.0.zip
drwxrwxr-x 7 xuquan xuquan 4096 2月 14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan 72 2月 14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan 1178859 7月 18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan 149 2月 6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月 6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$cd gatk-4.1.0.0/
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$ll
总用量 412892
-rwxr-xr-x 1 xuquan xuquan 19384 1月 29 22:23 gatk
-rw-r--r-- 1 xuquan xuquan 833099 1月 29 22:23 gatk-completion.sh
-rw-r--r-- 1 xuquan xuquan 935 1月 29 22:23 gatkcondaenv.yml
-rw-r--r-- 1 xuquan xuquan 3635 1月 29 22:23 GATKConfig.EXAMPLE.properties
drwxr-xr-x 2 xuquan xuquan 49152 1月 29 22:23 gatkdoc
-rw-r--r-- 1 xuquan xuquan 283655241 1月 29 22:23 gatk-package-4.1.0.0-local.jar
-rw-r--r-- 1 xuquan xuquan 138052762 1月 29 22:23 gatk-package-4.1.0.0-spark.jar
-rw-r--r-- 1 xuquan xuquan 114047 1月 29 22:23 gatkPythonPackageArchive.zip
-rw-r--r-- 1 xuquan xuquan 38055 1月 29 22:23 README.md
drwxr-xr-x 5 xuquan xuquan 93 1月 29 22:23 scripts
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$pwd
/home/xuquan/conda/softwares/gatk-4.1.0.0
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$echo "export PATH=/home/xuquan/conda/softwares/gatk-4.1.0.0:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$source ~/.bashrc
至此,GATK安装成功,可以在任何地方直接使用“gatk”命令调用该软件了!
C7. 标记PCR重复
(wgs) [xuquan@zhswlcserver:conda]$gatk MarkDuplicates -I /home/xuquan/conda/output/SRR1770413.aln.sorted.bam -O /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -M /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates -I /home/xuquan/conda/output/SRR1770413.aln.sorted.bam -O /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -M /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt
10:37:29.631 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Feb 14 10:37:29 CST 2019] MarkDuplicates --INPUT /home/xuquan/conda/output/SRR1770413.aln.sorted.bam --OUTPUT /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam --METRICS_FILE /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Feb 14 10:37:31 CST 2019] Executing as xuquan@zhswlcserver on Linux 3.10.0-862.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.0.0
INFO 2019-02-14 10:37:31 MarkDuplicates Start of doWork freeMemory: 1414574064; totalMemory: 1430781952; maxMemory: 28631367680
INFO 2019-02-14 10:37:31 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-02-14 10:37:31 MarkDuplicates Will retain up to 103736839 data points before spilling to disk.
WARNING 2019-02-14 10:37:32 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR1770413.21706. Cause: String 'SRR1770413.21706' did not start with a parsable number.
INFO 2019-02-14 10:37:53 MarkDuplicates Read 926112 records. 0 pairs never matched.
INFO 2019-02-14 10:37:54 MarkDuplicates After buildSortedReadEndLists freeMemory: 3470415240; totalMemory: 4468506624; maxMemory: 28631367680
INFO 2019-02-14 10:37:54 MarkDuplicates Will retain up to 894730240 duplicate indices before spilling to disk.
INFO 2019-02-14 10:38:02 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-02-14 10:38:02 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2019-02-14 10:38:03 MarkDuplicates Sorting list of duplicate records.
INFO 2019-02-14 10:38:09 MarkDuplicates After generateDuplicateIndexes freeMemory: 4405235528; totalMemory: 11635523584; maxMemory: 28631367680
INFO 2019-02-14 10:38:09 MarkDuplicates Marking 19497 records as duplicates.
INFO 2019-02-14 10:38:09 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2019-02-14 10:38:09 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2019-02-14 10:38:24 MarkDuplicates Before output close freeMemory: 11565974312; totalMemory: 11643387904; maxMemory: 28631367680
INFO 2019-02-14 10:38:24 MarkDuplicates After output close freeMemory: 14491002760; totalMemory: 14612430848; maxMemory: 28631367680
[Thu Feb 14 10:38:24 CST 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.92 minutes.
Runtime.totalMemory()=14612430848
Tool returned:
0
前面排序后产生的sorted.bam文件这时候也可以删除了:
[xuquan@zhswlcserver:output]$rm -rf /home/xuquan/conda/output/SRR1770413.aln.sorted.bam
[xuquan@zhswlcserver:output]$ll
总用量 293264
-rw-rw-r-- 1 xuquan xuquan 300295085 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 3559 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
[xuquan@zhswlcserver:output]$ll -lh
总用量 287M
-rw-rw-r-- 1 xuquan xuquan 287M 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
C8. 创建比对索引文件
不论是否有后续分析,为BAM文件创建索引应该作为一个常规步骤,它可以让我们快速地访问基因组上任意位置的比对情况,这一点非常有助于我们随时了解数据!
(wgs) [xuquan@zhswlcserver:conda]$samtools index /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam
(wgs) [xuquan@zhswlcserver:conda]$ll -lh /home/xuquan/conda/output
总用量 287M
-rw-rw-r-- 1 xuquan xuquan 287M 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 15K 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
D. 变异检测
接下来是用GATK完成变异检测。
D1. 生成dict文件
在开始变异检测之前我们还需要先为E.coli K12的参考序列生成一个.dict文件,这可以通过调用CreateSequenceDictonary模块来完成(这是原来picard的功能)。
(wgs) [xuquan@zhswlcserver:conda]$gatk CreateSequenceDictionary -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -O /home/xuquan/conda/db/E.coli_K12_MG1655.dict
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateSequenceDictionary -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -O /home/xuquan/conda/db/E.coli_K12_MG1655.dict
10:56:00.213 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Feb 14 10:56:00 CST 2019] CreateSequenceDictionary --OUTPUT /home/xuquan/conda/db/E.coli_K12_MG1655.dict --REFERENCE /home/xuquan/conda/db/E.coli_K12_MG1655.fa --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Feb 14 10:56:02 CST 2019] Executing as xuquan@zhswlcserver on Linux 3.10.0-862.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.0.0
[Thu Feb 14 10:56:02 CST 2019] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2094006272
Tool returned:
0
(wgs) [xuquan@zhswlcserver:conda]$cd db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 13896
-rw-rw-r-- 1 xuquan xuquan 128 2月 14 10:56 E.coli_K12_MG1655.dict
-rw-rw-r-- 1 xuquan xuquan 4699745 2月 13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 12 2月 14 09:50 E.coli_K12_MG1655.fa.amb
-rw-rw-r-- 1 xuquan xuquan 98 2月 14 09:50 E.coli_K12_MG1655.fa.ann
-rw-rw-r-- 1 xuquan xuquan 4641732 2月 14 09:50 E.coli_K12_MG1655.fa.bwt
-rw-rw-r-- 1 xuquan xuquan 29 2月 14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1160415 2月 14 09:50 E.coli_K12_MG1655.fa.pac
-rw-rw-r-- 1 xuquan xuquan 2320880 2月 14 09:50 E.coli_K12_MG1655.fa.sa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月 13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
唯一需要注意的是.dict文件的名字前缀需要和fasta的一样,并跟它在同一个路径下,这样GATK才能够找到。
D2. 生成中间文件gvcf
(wgs) [xuquan@zhswlcserver:output]$gatk HaplotypeCaller \
> -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa \
> --emit-ref-confidence GVCF \
> -I /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam \
> -O /home/xuquan/conda/output/SRR1770413.g.vcf
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa --emit-ref-confidence GVCF -I /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -O /home/xuquan/conda/output/SRR1770413.g.vcf
11:01:36.742 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:01:38.551 INFO HaplotypeCaller - ------------------------------------------------------------
11:01:38.553 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.0.0
11:01:38.553 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
11:01:38.558 INFO HaplotypeCaller - Executing as xuquan@zhswlcserver on Linux v3.10.0-862.9.1.el7.x86_64 amd64
11:01:38.558 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
11:01:38.559 INFO HaplotypeCaller - Start Date/Time: 2019年2月14日 上午11时01分36秒
11:01:38.559 INFO HaplotypeCaller - ------------------------------------------------------------
11:01:38.560 INFO HaplotypeCaller - ------------------------------------------------------------
11:01:38.561 INFO HaplotypeCaller - HTSJDK Version: 2.18.2
11:01:38.561 INFO HaplotypeCaller - Picard Version: 2.18.25
11:01:38.561 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:01:38.561 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:01:38.562 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:01:38.562 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:01:38.562 INFO HaplotypeCaller - Deflater: IntelDeflater
11:01:38.562 INFO HaplotypeCaller - Inflater: IntelInflater
11:01:38.562 INFO HaplotypeCaller - GCS max retries/reopens: 20
11:01:38.563 INFO HaplotypeCaller - Requester pays: disabled
11:01:38.563 INFO HaplotypeCaller - Initializing engine
11:01:39.032 INFO HaplotypeCaller - Done initializing engine
11:01:39.035 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
11:01:39.043 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
11:01:39.043 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
11:01:39.063 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
11:01:39.154 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
11:01:39.227 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
11:01:39.227 INFO IntelPairHmm - Available threads: 96
11:01:39.227 INFO IntelPairHmm - Requested threads: 4
11:01:39.228 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
11:01:39.304 INFO ProgressMeter - Starting traversal
11:01:39.305 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
11:01:49.539 INFO ProgressMeter - NC_000913.3:71452 0.2 240 1407.5
11:01:59.820 INFO ProgressMeter - NC_000913.3:173452 0.3 580 1696.4
11:02:09.959 INFO ProgressMeter - NC_000913.3:301627 0.5 1010 1977.0
11:02:19.968 INFO ProgressMeter - NC_000913.3:414855 0.7 1390 2051.0
11:02:30.260 INFO ProgressMeter - NC_000913.3:522855 0.8 1750 2060.7
11:02:40.405 INFO ProgressMeter - NC_000913.3:580475 1.0 1970 1934.5
11:02:50.708 INFO ProgressMeter - NC_000913.3:657006 1.2 2230 1873.9
11:03:00.741 INFO ProgressMeter - NC_000913.3:763305 1.4 2590 1908.3
11:03:10.920 INFO ProgressMeter - NC_000913.3:870993 1.5 2950 1932.0
11:03:21.062 INFO ProgressMeter - NC_000913.3:995835 1.7 3370 1987.1
11:03:31.155 INFO ProgressMeter - NC_000913.3:1118381 1.9 3780 2027.7
11:03:41.197 INFO ProgressMeter - NC_000913.3:1236969 2.0 4180 2057.6
11:03:51.283 INFO ProgressMeter - NC_000913.3:1365797 2.2 4620 2100.4
11:04:01.482 INFO ProgressMeter - NC_000913.3:1453689 2.4 4920 2076.3
11:04:11.601 INFO ProgressMeter - NC_000913.3:1573689 2.5 5320 2095.9
11:04:21.707 INFO ProgressMeter - NC_000913.3:1685369 2.7 5700 2105.9
11:04:31.750 INFO ProgressMeter - NC_000913.3:1814094 2.9 6130 2132.9
11:04:41.817 INFO ProgressMeter - NC_000913.3:1942577 3.0 6560 2156.6
11:04:51.823 INFO ProgressMeter - NC_000913.3:2069666 3.2 6990 2178.5
11:05:02.047 INFO ProgressMeter - NC_000913.3:2222980 3.4 7510 2222.5
11:05:12.188 INFO ProgressMeter - NC_000913.3:2357668 3.5 7960 2243.5
11:05:22.277 INFO ProgressMeter - NC_000913.3:2486668 3.7 8390 2257.7
11:05:32.309 INFO ProgressMeter - NC_000913.3:2622832 3.9 8850 2278.9
11:05:42.403 INFO ProgressMeter - NC_000913.3:2759594 4.1 9310 2297.8
11:05:52.468 INFO ProgressMeter - NC_000913.3:2884593 4.2 9730 2306.0
11:06:02.609 INFO ProgressMeter - NC_000913.3:3012443 4.4 10160 2315.2
11:06:12.640 INFO ProgressMeter - NC_000913.3:3143696 4.6 10600 2326.8
11:06:22.844 INFO ProgressMeter - NC_000913.3:3272395 4.7 11030 2334.1
11:06:32.947 INFO ProgressMeter - NC_000913.3:3400379 4.9 11460 2341.6
11:06:42.980 INFO ProgressMeter - NC_000913.3:3534218 5.1 11910 2353.2
11:06:53.080 INFO ProgressMeter - NC_000913.3:3670538 5.2 12370 2365.4
11:07:03.219 INFO ProgressMeter - NC_000913.3:3798008 5.4 12800 2371.0
11:07:13.327 INFO ProgressMeter - NC_000913.3:3926285 5.6 13230 2376.5
11:07:23.580 INFO ProgressMeter - NC_000913.3:4061835 5.7 13690 2385.9
11:07:33.741 INFO ProgressMeter - NC_000913.3:4193508 5.9 14130 2392.0
11:07:43.804 INFO ProgressMeter - NC_000913.3:4323124 6.1 14570 2398.4
11:07:53.867 INFO ProgressMeter - NC_000913.3:4454885 6.2 15010 2404.4
11:08:04.074 INFO ProgressMeter - NC_000913.3:4584445 6.4 15450 2409.2
11:08:08.479 INFO HaplotypeCaller - 32196 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
32196 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
32196 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
32196 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
32196 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
12774 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
12774 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
12774 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
12774 read(s) filtered by: MappingQualityReadFilter
19422 read(s) filtered by: NotDuplicateReadFilter
11:08:08.479 INFO ProgressMeter - NC_000913.3:4641445 6.5 15641 2411.4
11:08:08.479 INFO ProgressMeter - Traversal complete. Processed 15641 total regions in 6.5 minutes.
11:08:08.504 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.013602151000000002
11:08:08.504 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 2.339890145
11:08:08.504 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 4.57 sec
11:08:08.504 INFO HaplotypeCaller - Shutting down engine
[2019年2月14日 上午11时08分08秒] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 6.53 minutes.
Runtime.totalMemory()=2526019584
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306932
-rw-rw-r-- 1 xuquan xuquan 300295085 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 14576 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3559 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan 13935329 2月 14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan 40656 2月 14 11:08 SRR1770413.g.vcf.idx
D3. 通过gvcf检测变异
(wgs) [xuquan@zhswlcserver:output]$gatk GenotypeGVCFs \
> -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa \
> -V /home/xuquan/conda/output/SRR1770413.g.vcf \
> -O /home/xuquan/conda/output/SRR1770413.vcf
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -V /home/xuquan/conda/output/SRR1770413.g.vcf -O /home/xuquan/conda/output/SRR1770413.vcf
11:12:15.631 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:12:17.502 INFO GenotypeGVCFs - ------------------------------------------------------------
11:12:17.504 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
11:12:17.504 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
11:12:17.509 INFO GenotypeGVCFs - Executing as xuquan@zhswlcserver on Linux v3.10.0-862.9.1.el7.x86_64 amd64
11:12:17.509 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
11:12:17.510 INFO GenotypeGVCFs - Start Date/Time: 2019年2月14日 上午11时12分15秒
11:12:17.510 INFO GenotypeGVCFs - ------------------------------------------------------------
11:12:17.510 INFO GenotypeGVCFs - ------------------------------------------------------------
11:12:17.511 INFO GenotypeGVCFs - HTSJDK Version: 2.18.2
11:12:17.511 INFO GenotypeGVCFs - Picard Version: 2.18.25
11:12:17.512 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:12:17.512 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:12:17.512 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:12:17.512 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:12:17.512 INFO GenotypeGVCFs - Deflater: IntelDeflater
11:12:17.513 INFO GenotypeGVCFs - Inflater: IntelInflater
11:12:17.513 INFO GenotypeGVCFs - GCS max retries/reopens: 20
11:12:17.513 INFO GenotypeGVCFs - Requester pays: disabled
11:12:17.513 INFO GenotypeGVCFs - Initializing engine
11:12:17.935 INFO FeatureManager - Using codec VCFCodec to read file file:///home/xuquan/conda/output/SRR1770413.g.vcf
11:12:17.982 INFO GenotypeGVCFs - Done initializing engine
11:12:18.051 INFO ProgressMeter - Starting traversal
11:12:18.052 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
11:12:19.143 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location NC_000913.3:378700 the annotation MLEAC=[2, 0] was not a numerical value and was ignored
11:12:23.758 INFO ProgressMeter - NC_000913.3:4628484 0.1 151422 1592518.8
11:12:23.758 INFO ProgressMeter - Traversal complete. Processed 151422 total variants in 0.1 minutes.
11:12:23.773 INFO GenotypeGVCFs - Shutting down engine
[2019年2月14日 上午11时12分23秒] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=2040528896
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 307040
-rw-rw-r-- 1 xuquan xuquan 300295085 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 14576 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3559 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan 13935329 2月 14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan 40656 2月 14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan 89446 2月 14 11:12 SRR1770413.vcf
-rw-rw-r-- 1 xuquan xuquan 18469 2月 14 11:12 SRR1770413.vcf.idx
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 300M
-rw-rw-r-- 1 xuquan xuquan 287M 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 15K 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan 14M 2月 14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan 40K 2月 14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan 88K 2月 14 11:12 SRR1770413.vcf
-rw-rw-r-- 1 xuquan xuquan 19K 2月 14 11:12 SRR1770413.vcf.idx
很快我们就获得了E.coli K12这个样本初步的变异结果:SRR1770413.vcf。之所以非要分成两个步骤,是因为“我”想借此告诉大家,变异检测不是一个样本的事情,有越多的同类样本放在一起 joint calling 结果将会越准确,而如果样本足够多的话,在低测序深度的情况下也同样可以获得完整并且准确的结果,而这样的分步方式是应对多样本的好方法。
D4. 压缩VCF并构建索引
最后,我们用bgzip对这个VCF进行压缩,并用tabix为它构建索引,方便以后的分析。
(wgs) [xuquan@zhswlcserver:output]$bgzip -f /home/xuquan/conda/output/SRR1770413.vcf
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306972
-rw-rw-r-- 1 xuquan xuquan 300295085 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 14576 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3559 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan 13935329 2月 14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan 40656 2月 14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan 17295 2月 14 11:16 SRR1770413.vcf.gz
-rw-rw-r-- 1 xuquan xuquan 18469 2月 14 11:12 SRR1770413.vcf.idx
(wgs) [xuquan@zhswlcserver:output]$tabix -p vcf /home/xuquan/conda/output/SRR1770413.vcf.gz
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306976
-rw-rw-r-- 1 xuquan xuquan 300295085 2月 14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 14576 2月 14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3559 2月 14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan 13935329 2月 14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan 40656 2月 14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan 17295 2月 14 11:16 SRR1770413.vcf.gz
-rw-rw-r-- 1 xuquan xuquan 651 2月 14 11:17 SRR1770413.vcf.gz.tbi
-rw-rw-r-- 1 xuquan xuquan 18469 2月 14 11:12 SRR1770413.vcf.idx
至此,整个WGS数据分析就差不多了。由于E.coli基因组与人的基因组数据分析略有不同,后面还要继续通过人的基因组数据分析继续进行学习,并着手开始搭建自己的流程!