WGS数据分析流程学习与开发过程全纪录(1)

A. 开发环境及数据准备工作

A1. 开发环境搭建

安装miniconda3,然后在其中创建WGS的开发环境:

# 1. Install miniconda3.
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh .
bash Miniconda3-latest-Linux-x86_64.sh  -b -p /home/xuquan/conda/miniconda3
echo "export PATH=/home/xuquan/conda/miniconda3/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

# 2. Config.
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes

# 3. Create QxWgsEnv!
conda create -n wgs python=2
source activate wgs

至此,已经创建了WGS数据分析流程的conda开发环境

A2. 准备测试数据

下载参考 E.coli K12 参考基因组:

mkdir /home/xuquan/conda/db
cd /home/xuquan/conda/db
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gzip -dc GCF_000005845.2_ASM584v2_genomic.fna.gz > E.coli_K12_MG1655.fa
less E.coli_K12_MG1655.fa # E.coli has only one chromesome!

下载测序数据:

(wgs) [xuquan@zhswlcserver:conda]$mkdir rawdata
(wgs) [xuquan@zhswlcserver:conda]$cd rawdata/
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 0
(wgs) [xuquan@zhswlcserver:rawdata]$wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413/SRR1770413.sra
--2019-02-14 09:00:47--  ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413/SRR1770413.sra
           => ‘SRR1770413.sra’
Resolving ftp-trace.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::12
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413 ... done.
==> SIZE SRR1770413.sra ... 199084896
==> PASV ... done.    ==> RETR SRR1770413.sra ... done.
Length: 199084896 (190M) (unauthoritative)

SRR1770413.sra                              100%[========================================================================================>] 189.86M  1.57MB/s    in 76s

2019-02-14 09:02:07 (2.51 MB/s) - ‘SRR1770413.sra’ saved [199084896]

(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 194420
-rw-rw-r-- 1 xuquan xuquan 199084896 2月  14 09:02 SRR1770413.sra

A3. 安装samtools

(wgs) [xuquan@zhswlcserver:db]$conda install samtools

遗憾的是,在安装了samtools,并尝试使用其时发现,报错了:

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

意思很明显,libcrypto.so.1.0.0 这个包找不到了!用ldd命令看看samtools依赖了哪些包:

(wgs) [xuquan@zhswlcserver:db]$which samtools
~/conda/miniconda3/envs/wgs/bin/samtools
(wgs) [xuquan@zhswlcserver:db]$ldd ~/conda/miniconda3/envs/wgs/bin/samtools
        linux-vdso.so.1 =>  (0x00007ffc4fbf9000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5c59071000)
        libz.so.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libz.so.1 (0x00007f5c59348000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f5c58d6f000)
        libbz2.so.1.0 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libbz2.so.1.0 (0x00007f5c59334000)
        liblzma.so.5 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/liblzma.so.5 (0x00007f5c5930a000)
        libdeflate.so => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libdeflate.so (0x00007f5c592fa000)
        libcurl.so.4 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libcurl.so.4 (0x00007f5c58cf6000)
        libcrypto.so.1.0.0 => not found
        libncursesw.so.6 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libncursesw.so.6 (0x00007f5c592bd000)
        libtinfow.so.6 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libtinfow.so.6 (0x00007f5c58cb9000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f5c588ec000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f5c5928d000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f5c586e4000)
        libssl.so.1.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/./libssl.so.1.1 (0x00007f5c5864f000)
        libcrypto.so.1.1 => /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/./libcrypto.so.1.1 (0x00007f5c58358000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f5c58154000)

果然,libcrypto.so.1.0.0 包是 not found!

于是,尝试在系统中搜索该包:

(wgs) [xuquan@zhswlcserver:conda]$find / -name libcrypto.so.1*
find: ‘/sys/kernel/debug’: 权限不够
find: ‘/database/backup/root/DBI-1.633’: 权限不够
find: ‘/database/backup/root/VirtualBox VMs/qiime/Snapshots’: 权限不够
find: ‘/database/backup/root/VirtualBox VMs/qiime/Logs’: 权限不够
find: ‘/usr/libexec/initscripts/legacy-actions/auditd’: 权限不够
find: ‘/usr/share/rhel/secrets’: 权限不够
find: ‘/usr/share/Pegasus/scripts’: 权限不够
find: ‘/usr/share/polkit-1/rules.d’: 权限不够
/usr/lib64/libcrypto.so.1.0.2k
find: ‘/usr/lib64/Pegasus’: 权限不够
...

发现了 /usr/lib64/libcrypto.so.1.0.2k 这个包,接着使用软连接的形式连接到conda环境中去:

(wgs) [xuquan@zhswlcserver:conda]$ln -s /usr/lib64/libcrypto.so.1.0.2k /home/xuquan/conda/miniconda3/envs/wgs/bin/../lib/libcrypto.so.1.0.0
(wgs) [xuquan@zhswlcserver:conda]$samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file

samtools 已经运行成功了,说明问题已经得到解决了!

接着使用samtools对下载的参考基因组序列建一个索引,为方便其他数据分析工具(比如GATK)能够快速地获取fasta上的任何序列做准备。

(wgs) [xuquan@zhswlcserver:db]$cd ~/conda/db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5940
-rw-rw-r-- 1 xuquan xuquan 4699745 2月  13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月  13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz
(wgs) [xuquan@zhswlcserver:db]$samtools faidx E.coli_K12_MG1655.fa
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5944
-rw-rw-r-- 1 xuquan xuquan 4699745 2月  13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan      29 2月  14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1379902 2月  13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz

除了方便其他工具之外,我们可以通过这样的索引来获取fasta文件中任意位置的序列或者任意完整的染色体序列。可以很方便地完成对参考序列(或者任意fasta文件)特定区域序列的提取。举个例子:

(wgs) [xuquan@zhswlcserver:db]$samtools faidx E.coli_K12_MG1655.fa NC_000913.3:1000000-1000200

我们就获得了E.coli K12参考序列上的这一段序列:

>NC_000913.3:1000000-1000200
GTGTCAGCTTTCGTGGTGTGCAGCTGGCGTCAGATGACAACATGCTGCCAGACAGCCTGA
AAGGGTTTGCGCCTGTGGTGCGTGGTATCGCCAAAAGCAATGCCCAGATAACGATTAAGC
AAAATGGTTACACCATTTACCAAACTTATGTATCGCCTGGTGCTTTTGAAATTAGTGATC
TCTATTCCACGTCGTCGAGCG

A4. 安装 sratoolkit

从该网站中找到对应的版本进行下载:

(wgs) [xuquan@zhswlcserver:conda]$mkdir softwares
(wgs) [xuquan@zhswlcserver:conda]$cd softwares/
(wgs) [xuquan@zhswlcserver:softwares]$wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
--2019-02-14 09:07:48--  http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
Resolving ftp-trace.ncbi.nlm.nih.gov... 130.14.250.13, 2607:f220:41e:250::11
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.13|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz [following]
--2019-02-14 09:07:49--  https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
Connecting to ftp-trace.ncbi.nlm.nih.gov|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 107119239 (102M) [application/x-gzip]
Saving to: ‘sratoolkit.2.9.4-centos_linux64.tar.gz’

sratoolkit.2.9.4-centos_linux64.tar.gz      100%[========================================================================================>] 102.16M  1.72MB/s    in 44s

2019-02-14 09:08:35 (2.30 MB/s) - ‘sratoolkit.2.9.4-centos_linux64.tar.gz’ saved [107119239/107119239]

(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 104612
-rw-rw-r-- 1 xuquan xuquan 107119239 2月   6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$tar zxvf sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 104612
drwxrwxr-x 5 xuquan xuquan       149 2月   6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月   6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz

(wgs) [xuquan@zhswlcserver:bin]$echo "export PATH=/home/xuquan/conda/softwares/sratoolkit.2.9.4-centos_linux64/bin:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:bin]$source ~/.bashrc

至此,sratoolkit 已经安装并配置好了,接下来对下载的 SRA 数据转化为 fastq 文件:

(wgs) [xuquan@zhswlcserver:rawdata]$cd ~/conda/rawdata
(wgs) [xuquan@zhswlcserver:rawdata]$fastq-dump --split-files SRR1770413.sra
Read 643253 spots for SRR1770413.sra
Written 643253 spots for SRR1770413.sra
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 1045364
-rw-rw-r-- 1 xuquan xuquan 435681114 2月  14 09:15 SRR1770413_1.fastq
-rw-rw-r-- 1 xuquan xuquan 435681114 2月  14 09:15 SRR1770413_2.fastq
-rw-rw-r-- 1 xuquan xuquan 199084896 2月  14 09:02 SRR1770413.sra

至此,得到这个E.coli K12数据的 read1 和 read2 了!

实际上,fastq-dump 除了可以将已经下载到本地的SRA格式的数据转化为fastq格式的数据外,还可以直接从NCBI数据库下载并转化数据:

fastq-dump --split-files SRR1770413

A5. 安装HTSlib的相关工具

安装该软件的目的主要是为了利用 bgzip,htsfile,tabix等工具:

(wgs) [xuquan@zhswlcserver:softwares]$wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
(wgs) [xuquan@zhswlcserver:softwares]$tar jxvf htslib-1.9.tar.bz2
(wgs) [xuquan@zhswlcserver:softwares]$mkdir htslib-1.9-install
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$cd htslib-1.9
(wgs) [xuquan@zhswlcserver:htslib-1.9]$./configure --prefix=/home/xuquan/conda/softwares/htslib-1.9-install
(wgs) [xuquan@zhswlcserver:htslib-1.9]$make
(wgs) [xuquan@zhswlcserver:htslib-1.9]$make install
(wgs) [xuquan@zhswlcserver:htslib-1.9]$cd ../htslib-1.9-install/
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$ll
总用量 0
drwxr-xr-x 2 xuquan xuquan  60 2月  14 09:26 bin
drwxr-xr-x 3 xuquan xuquan  27 2月  14 09:26 include
drwxr-xr-x 3 xuquan xuquan 115 2月  14 09:26 lib
drwxrwxr-x 3 xuquan xuquan  24 2月  14 09:26 share
(wgs) [xuquan@zhswlcserver:htslib-1.9-install]$cd bin/
(wgs) [xuquan@zhswlcserver:bin]$ll
总用量 7924
-rwxr-xr-x 1 xuquan xuquan 2398296 2月  14 09:26 bgzip
-rwxr-xr-x 1 xuquan xuquan 2820880 2月  14 09:26 htsfile
-rwxr-xr-x 1 xuquan xuquan 2888072 2月  14 09:26 tabix
(wgs) [xuquan@zhswlcserver:bin]$pwd
/home/xuquan/conda/softwares/htslib-1.9-install/bin
(wgs) [xuquan@zhswlcserver:bin]$echo "export PATH=/home/xuquan/conda/softwares/htslib-1.9-install/bin:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:bin]$source ~/.bashrc

至此,htslib-1.9 已经安装并配置完成了,可以直接通过 bgzip 等命令使用相应的工具了:

(wgs) [xuquan@zhswlcserver:rawdata]$cd /home/xuquan/conda/rawdata
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 1045364
-rw-rw-r-- 1 xuquan xuquan 435681114 2月  14 09:15 SRR1770413_1.fastq
-rw-rw-r-- 1 xuquan xuquan 435681114 2月  14 09:15 SRR1770413_2.fastq
-rw-rw-r-- 1 xuquan xuquan 199084896 2月  14 09:02 SRR1770413.sra
(wgs) [xuquan@zhswlcserver:rawdata]$bgzip -f SRR1770413_1.fastq
(wgs) [xuquan@zhswlcserver:rawdata]$bgzip -f SRR1770413_2.fastq
(wgs) [xuquan@zhswlcserver:rawdata]$ll
总用量 453048
-rw-rw-r-- 1 xuquan xuquan 125123648 2月  14 09:34 SRR1770413_1.fastq.gz
-rw-rw-r-- 1 xuquan xuquan 139709727 2月  14 09:36 SRR1770413_2.fastq.gz
-rw-rw-r-- 1 xuquan xuquan 199084896 2月  14 09:02 SRR1770413.sra

至此,利用 bgzip 完成了对两个fastq文件的压缩:

下载完成后,我们最好用bgzip(不推荐gzip)将其压缩为.gz文件,这样可以节省空间,而且也不会对接下来的数据分析产生影响。

B. 数据质控

暂时跳过,后面再补

C. 序列比对

所谓比对,就是把测序数据定位到参考基因组上,确定每一个read在基因组中的位置。这里,我们用目前使用最广泛的BWA来完成这个工作。所以,首先,安装BWA!

C1. 安装 BWA

(wgs) [xuquan@zhswlcserver:conda]$conda install bwa
(wgs) [xuquan@zhswlcserver:conda]$which bwa
~/conda/miniconda3/envs/wgs/bin/bwa
(wgs) [xuquan@zhswlcserver:conda]$bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

至此,BWA已经安装成功了,接下来可以使用了!

C2. 构建比对索引

在正式比对之前,需要先为参考序列构建BWA比对所需的FM-index(比对索引):

(wgs) [xuquan@zhswlcserver:conda]$cd ~/conda/db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 5944
-rw-rw-r-- 1 xuquan xuquan 4699745 2月  13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan      29 2月  14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1379902 2月  13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz

(wgs) [xuquan@zhswlcserver:db]$bwa index E.coli_K12_MG1655.fa
[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.81 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.04 sec
[bwa_index] Construct SA from BWT and Occ... 0.68 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index E.coli_K12_MG1655.fa
[main] Real time: 2.652 sec; CPU: 2.613 sec

(wgs) [xuquan@zhswlcserver:db]$ll
总用量 13892
-rw-rw-r-- 1 xuquan xuquan 4699745 2月  13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan      12 2月  14 09:50 E.coli_K12_MG1655.fa.amb
-rw-rw-r-- 1 xuquan xuquan      98 2月  14 09:50 E.coli_K12_MG1655.fa.ann
-rw-rw-r-- 1 xuquan xuquan 4641732 2月  14 09:50 E.coli_K12_MG1655.fa.bwt
-rw-rw-r-- 1 xuquan xuquan      29 2月  14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1160415 2月  14 09:50 E.coli_K12_MG1655.fa.pac
-rw-rw-r-- 1 xuquan xuquan 2320880 2月  14 09:50 E.coli_K12_MG1655.fa.sa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月  13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz

由于这里使用的参考基因组序列较短,只需几秒钟就可以完成这个索引文件的构建(对于人类基因组则需要3个小时的时间)。创建完成之后,将多出5份以E.coli_K12_MG1655.fa为前缀的序列索引文件。

C3. 序列比对

(wgs) [xuquan@zhswlcserver:conda]$cd ~/conda
(wgs) [xuquan@zhswlcserver:conda]$mkdir output
(wgs) [xuquan@zhswlcserver:conda]$bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz > /home/xuquan/conda/output/SRR1770413.aln.sam
...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -R @RG\tID:foo\tPL:illumina\tSM:E.coli_K12 /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz
[main] Real time: 78.611 sec; CPU: 312.660 sec

-R 设置Read Group信息,虽然“我”在以前的文章中已经反复强调过它的重要性,但这里还是再说一次,它是read数据的组别标识,并且其中的ID,PL和SM信息在正式的项目中是不能缺少的(如果样本包含多个测序文库的话,LB信息也不要省略),另外由于考虑到与GATK的兼容关系,PL(测序平台)信息不能随意指定,必须是:ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS或UNKNOWN这12个中的一个。

至此,通过BWA mem完成了序列比对,生成了比对后的sam格式的数据文件:

[xuquan@zhswlcserver:output]$cd ~/conda/output/
[xuquan@zhswlcserver:output]$ll
总用量 892060
-rw-rw-r-- 1 xuquan xuquan 913468760 2月  14 10:02 SRR1770413.aln.sam
[xuquan@zhswlcserver:output]$ll -lh
总用量 872M
-rw-rw-r-- 1 xuquan xuquan 872M 2月  14 10:02 SRR1770413.aln.sam

接着使用samtools转化为BAM格式:

(wgs) [xuquan@zhswlcserver:conda]$samtools view -Sb /home/xuquan/conda/output/SRR1770413.aln.sam > /home/xuquan/conda/output/SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:conda]$cd output/
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 1170228
-rw-rw-r-- 1 xuquan xuquan 284843215 2月  14 10:08 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 913468760 2月  14 10:02 SRR1770413.aln.sam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 1.2G
-rw-rw-r-- 1 xuquan xuquan 272M 2月  14 10:08 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 872M 2月  14 10:02 SRR1770413.aln.sam

实际上,生成的sam格式文件在后面不会被用到,而且由于其文件大小较大,因此可以将两步合并为一步:

(wgs) [xuquan@zhswlcserver:output]$bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz | samtools view -Sb - > /home/xuquan/conda/output/SRR1770413.aln.bam
...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -R @RG\tID:foo\tPL:illumina\tSM:E.coli_K12 /home/xuquan/conda/db/E.coli_K12_MG1655.fa /home/xuquan/conda/rawdata/SRR1770413_1.fastq.gz /home/xuquan/conda/rawdata/SRR1770413_2.fastq.gz
[main] Real time: 80.262 sec; CPU: 320.189 sec
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 278168
-rw-rw-r-- 1 xuquan xuquan 284843215 2月  14 10:12 SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 272M
-rw-rw-r-- 1 xuquan xuquan 272M 2月  14 10:12 SRR1770413.aln.bam

这样,最后就只产生一个bam格式的比对结果文件了!

C5. 比对结果排序

使用samtools对原始的比对结果按照参考序列位置从小到大进行排序,设置为使用4个线程进行排序。只有这个步骤完成之后才可以继续往下进行!

(wgs) [xuquan@zhswlcserver:output]$samtools sort -@ 4 -m 4G -O bam -o /home/xuquan/conda/output/SRR1770413.aln.sorted.bam /home/xuquan/conda/output/SRR1770413.aln.bam
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 480932
-rw-rw-r-- 1 xuquan xuquan 284843215 2月  14 10:12 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 207629239 2月  14 10:16 SRR1770413.aln.sorted.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 470M
-rw-rw-r-- 1 xuquan xuquan 272M 2月  14 10:12 SRR1770413.aln.bam
-rw-rw-r-- 1 xuquan xuquan 199M 2月  14 10:16 SRR1770413.aln.sorted.bam

前一步比对产生的bam文件这时候也可以删除了:

(wgs) [xuquan@zhswlcserver:output]$rm -rf /home/xuquan/conda/output/SRR1770413.aln.bam
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 199M
-rw-rw-r-- 1 xuquan xuquan 199M 2月  14 10:16 SRR1770413.aln.sorted.bam

C6. 安装GATK

由于下一步需要使用到GATK的部分功能,因此这里安装GATK:

(wgs) [xuquan@zhswlcserver:conda]$cd softwares/
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 105772
drwxrwxr-x 7 xuquan xuquan      4096 2月  14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan        72 2月  14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan   1178859 7月  18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan       149 2月   6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月   6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$wget https://github.com/broadinstitute/gatk/releases/download/4.1.0.0/gatk-4.1.0.0.zip
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 491960
-rw-rw-r-- 1 xuquan xuquan 395455082 1月  30 11:33 gatk-4.1.0.0.zip
drwxrwxr-x 7 xuquan xuquan      4096 2月  14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan        72 2月  14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan   1178859 7月  18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan       149 2月   6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月   6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$unzip gatk-4.1.0.0.zip
(wgs) [xuquan@zhswlcserver:softwares]$ll
总用量 491964
drwxr-xr-x 4 xuquan xuquan      4096 1月  29 22:23 gatk-4.1.0.0
-rw-rw-r-- 1 xuquan xuquan 395455082 1月  30 11:33 gatk-4.1.0.0.zip
drwxrwxr-x 7 xuquan xuquan      4096 2月  14 09:26 htslib-1.9
drwxrwxr-x 6 xuquan xuquan        72 2月  14 09:26 htslib-1.9-install
-rw-rw-r-- 1 xuquan xuquan   1178859 7月  18 2018 htslib-1.9.tar.bz2
drwxrwxr-x 5 xuquan xuquan       149 2月   6 00:33 sratoolkit.2.9.4-centos_linux64
-rw-rw-r-- 1 xuquan xuquan 107119239 2月   6 01:12 sratoolkit.2.9.4-centos_linux64.tar.gz
(wgs) [xuquan@zhswlcserver:softwares]$cd gatk-4.1.0.0/
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$ll
总用量 412892
-rwxr-xr-x 1 xuquan xuquan     19384 1月  29 22:23 gatk
-rw-r--r-- 1 xuquan xuquan    833099 1月  29 22:23 gatk-completion.sh
-rw-r--r-- 1 xuquan xuquan       935 1月  29 22:23 gatkcondaenv.yml
-rw-r--r-- 1 xuquan xuquan      3635 1月  29 22:23 GATKConfig.EXAMPLE.properties
drwxr-xr-x 2 xuquan xuquan     49152 1月  29 22:23 gatkdoc
-rw-r--r-- 1 xuquan xuquan 283655241 1月  29 22:23 gatk-package-4.1.0.0-local.jar
-rw-r--r-- 1 xuquan xuquan 138052762 1月  29 22:23 gatk-package-4.1.0.0-spark.jar
-rw-r--r-- 1 xuquan xuquan    114047 1月  29 22:23 gatkPythonPackageArchive.zip
-rw-r--r-- 1 xuquan xuquan     38055 1月  29 22:23 README.md
drwxr-xr-x 5 xuquan xuquan        93 1月  29 22:23 scripts
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$pwd
/home/xuquan/conda/softwares/gatk-4.1.0.0
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$echo "export PATH=/home/xuquan/conda/softwares/gatk-4.1.0.0:$PATH" >> ~/.bashrc
(wgs) [xuquan@zhswlcserver:gatk-4.1.0.0]$source ~/.bashrc

至此,GATK安装成功,可以在任何地方直接使用“gatk”命令调用该软件了!

C7. 标记PCR重复

(wgs) [xuquan@zhswlcserver:conda]$gatk MarkDuplicates -I /home/xuquan/conda/output/SRR1770413.aln.sorted.bam -O /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -M /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates -I /home/xuquan/conda/output/SRR1770413.aln.sorted.bam -O /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -M /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt
10:37:29.631 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Feb 14 10:37:29 CST 2019] MarkDuplicates  --INPUT /home/xuquan/conda/output/SRR1770413.aln.sorted.bam --OUTPUT /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam --METRICS_FILE /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup_metrics.txt  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Feb 14 10:37:31 CST 2019] Executing as xuquan@zhswlcserver on Linux 3.10.0-862.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.0.0
INFO    2019-02-14 10:37:31     MarkDuplicates  Start of doWork freeMemory: 1414574064; totalMemory: 1430781952; maxMemory: 28631367680
INFO    2019-02-14 10:37:31     MarkDuplicates  Reading input file and constructing read end information.
INFO    2019-02-14 10:37:31     MarkDuplicates  Will retain up to 103736839 data points before spilling to disk.
WARNING 2019-02-14 10:37:32     AbstractOpticalDuplicateFinderCommandLineProgram        A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR1770413.21706. Cause: String 'SRR1770413.21706' did not start with a parsable number.
INFO    2019-02-14 10:37:53     MarkDuplicates  Read 926112 records. 0 pairs never matched.
INFO    2019-02-14 10:37:54     MarkDuplicates  After buildSortedReadEndLists freeMemory: 3470415240; totalMemory: 4468506624; maxMemory: 28631367680
INFO    2019-02-14 10:37:54     MarkDuplicates  Will retain up to 894730240 duplicate indices before spilling to disk.
INFO    2019-02-14 10:38:02     MarkDuplicates  Traversing read pair information and detecting duplicates.
INFO    2019-02-14 10:38:02     MarkDuplicates  Traversing fragment information and detecting duplicates.
INFO    2019-02-14 10:38:03     MarkDuplicates  Sorting list of duplicate records.
INFO    2019-02-14 10:38:09     MarkDuplicates  After generateDuplicateIndexes freeMemory: 4405235528; totalMemory: 11635523584; maxMemory: 28631367680
INFO    2019-02-14 10:38:09     MarkDuplicates  Marking 19497 records as duplicates.
INFO    2019-02-14 10:38:09     MarkDuplicates  Found 0 optical duplicate clusters.
INFO    2019-02-14 10:38:09     MarkDuplicates  Reads are assumed to be ordered by: coordinate
INFO    2019-02-14 10:38:24     MarkDuplicates  Before output close freeMemory: 11565974312; totalMemory: 11643387904; maxMemory: 28631367680
INFO    2019-02-14 10:38:24     MarkDuplicates  After output close freeMemory: 14491002760; totalMemory: 14612430848; maxMemory: 28631367680
[Thu Feb 14 10:38:24 CST 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.92 minutes.
Runtime.totalMemory()=14612430848
Tool returned:
0

前面排序后产生的sorted.bam文件这时候也可以删除了:

[xuquan@zhswlcserver:output]$rm -rf /home/xuquan/conda/output/SRR1770413.aln.sorted.bam
[xuquan@zhswlcserver:output]$ll
总用量 293264
-rw-rw-r-- 1 xuquan xuquan 300295085 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan      3559 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
[xuquan@zhswlcserver:output]$ll -lh
总用量 287M
-rw-rw-r-- 1 xuquan xuquan 287M 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt

C8. 创建比对索引文件

不论是否有后续分析,为BAM文件创建索引应该作为一个常规步骤,它可以让我们快速地访问基因组上任意位置的比对情况,这一点非常有助于我们随时了解数据!

(wgs) [xuquan@zhswlcserver:conda]$samtools index /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam
(wgs) [xuquan@zhswlcserver:conda]$ll -lh /home/xuquan/conda/output
总用量 287M
-rw-rw-r-- 1 xuquan xuquan 287M 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan  15K 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt

D. 变异检测

接下来是用GATK完成变异检测。

D1. 生成dict文件

在开始变异检测之前我们还需要先为E.coli K12的参考序列生成一个.dict文件,这可以通过调用CreateSequenceDictonary模块来完成(这是原来picard的功能)。

(wgs) [xuquan@zhswlcserver:conda]$gatk CreateSequenceDictionary -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -O /home/xuquan/conda/db/E.coli_K12_MG1655.dict
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateSequenceDictionary -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -O /home/xuquan/conda/db/E.coli_K12_MG1655.dict
10:56:00.213 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Feb 14 10:56:00 CST 2019] CreateSequenceDictionary  --OUTPUT /home/xuquan/conda/db/E.coli_K12_MG1655.dict --REFERENCE /home/xuquan/conda/db/E.coli_K12_MG1655.fa  --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Feb 14 10:56:02 CST 2019] Executing as xuquan@zhswlcserver on Linux 3.10.0-862.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.0.0
[Thu Feb 14 10:56:02 CST 2019] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2094006272
Tool returned:
0
(wgs) [xuquan@zhswlcserver:conda]$cd db/
(wgs) [xuquan@zhswlcserver:db]$ll
总用量 13896
-rw-rw-r-- 1 xuquan xuquan     128 2月  14 10:56 E.coli_K12_MG1655.dict
-rw-rw-r-- 1 xuquan xuquan 4699745 2月  13 17:35 E.coli_K12_MG1655.fa
-rw-rw-r-- 1 xuquan xuquan      12 2月  14 09:50 E.coli_K12_MG1655.fa.amb
-rw-rw-r-- 1 xuquan xuquan      98 2月  14 09:50 E.coli_K12_MG1655.fa.ann
-rw-rw-r-- 1 xuquan xuquan 4641732 2月  14 09:50 E.coli_K12_MG1655.fa.bwt
-rw-rw-r-- 1 xuquan xuquan      29 2月  14 08:50 E.coli_K12_MG1655.fa.fai
-rw-rw-r-- 1 xuquan xuquan 1160415 2月  14 09:50 E.coli_K12_MG1655.fa.pac
-rw-rw-r-- 1 xuquan xuquan 2320880 2月  14 09:50 E.coli_K12_MG1655.fa.sa
-rw-rw-r-- 1 xuquan xuquan 1379902 2月  13 17:33 GCF_000005845.2_ASM584v2_genomic.fna.gz

唯一需要注意的是.dict文件的名字前缀需要和fasta的一样,并跟它在同一个路径下,这样GATK才能够找到。

D2. 生成中间文件gvcf

(wgs) [xuquan@zhswlcserver:output]$gatk HaplotypeCaller \
> -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa \
> --emit-ref-confidence GVCF \
> -I /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam \
> -O /home/xuquan/conda/output/SRR1770413.g.vcf
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa --emit-ref-confidence GVCF -I /home/xuquan/conda/output/SRR1770413.aln.sorted.markdup.bam -O /home/xuquan/conda/output/SRR1770413.g.vcf
11:01:36.742 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:01:38.551 INFO  HaplotypeCaller - ------------------------------------------------------------
11:01:38.553 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.0.0
11:01:38.553 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
11:01:38.558 INFO  HaplotypeCaller - Executing as xuquan@zhswlcserver on Linux v3.10.0-862.9.1.el7.x86_64 amd64
11:01:38.558 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
11:01:38.559 INFO  HaplotypeCaller - Start Date/Time: 2019年2月14日 上午11时01分36秒
11:01:38.559 INFO  HaplotypeCaller - ------------------------------------------------------------
11:01:38.560 INFO  HaplotypeCaller - ------------------------------------------------------------
11:01:38.561 INFO  HaplotypeCaller - HTSJDK Version: 2.18.2
11:01:38.561 INFO  HaplotypeCaller - Picard Version: 2.18.25
11:01:38.561 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:01:38.561 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:01:38.562 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:01:38.562 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:01:38.562 INFO  HaplotypeCaller - Deflater: IntelDeflater
11:01:38.562 INFO  HaplotypeCaller - Inflater: IntelInflater
11:01:38.562 INFO  HaplotypeCaller - GCS max retries/reopens: 20
11:01:38.563 INFO  HaplotypeCaller - Requester pays: disabled
11:01:38.563 INFO  HaplotypeCaller - Initializing engine
11:01:39.032 INFO  HaplotypeCaller - Done initializing engine
11:01:39.035 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
11:01:39.043 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
11:01:39.043 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
11:01:39.063 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
11:01:39.154 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
11:01:39.227 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
11:01:39.227 INFO  IntelPairHmm - Available threads: 96
11:01:39.227 INFO  IntelPairHmm - Requested threads: 4
11:01:39.228 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
11:01:39.304 INFO  ProgressMeter - Starting traversal
11:01:39.305 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
11:01:49.539 INFO  ProgressMeter -    NC_000913.3:71452              0.2                   240           1407.5
11:01:59.820 INFO  ProgressMeter -   NC_000913.3:173452              0.3                   580           1696.4
11:02:09.959 INFO  ProgressMeter -   NC_000913.3:301627              0.5                  1010           1977.0
11:02:19.968 INFO  ProgressMeter -   NC_000913.3:414855              0.7                  1390           2051.0
11:02:30.260 INFO  ProgressMeter -   NC_000913.3:522855              0.8                  1750           2060.7
11:02:40.405 INFO  ProgressMeter -   NC_000913.3:580475              1.0                  1970           1934.5
11:02:50.708 INFO  ProgressMeter -   NC_000913.3:657006              1.2                  2230           1873.9
11:03:00.741 INFO  ProgressMeter -   NC_000913.3:763305              1.4                  2590           1908.3
11:03:10.920 INFO  ProgressMeter -   NC_000913.3:870993              1.5                  2950           1932.0
11:03:21.062 INFO  ProgressMeter -   NC_000913.3:995835              1.7                  3370           1987.1
11:03:31.155 INFO  ProgressMeter -  NC_000913.3:1118381              1.9                  3780           2027.7
11:03:41.197 INFO  ProgressMeter -  NC_000913.3:1236969              2.0                  4180           2057.6
11:03:51.283 INFO  ProgressMeter -  NC_000913.3:1365797              2.2                  4620           2100.4
11:04:01.482 INFO  ProgressMeter -  NC_000913.3:1453689              2.4                  4920           2076.3
11:04:11.601 INFO  ProgressMeter -  NC_000913.3:1573689              2.5                  5320           2095.9
11:04:21.707 INFO  ProgressMeter -  NC_000913.3:1685369              2.7                  5700           2105.9
11:04:31.750 INFO  ProgressMeter -  NC_000913.3:1814094              2.9                  6130           2132.9
11:04:41.817 INFO  ProgressMeter -  NC_000913.3:1942577              3.0                  6560           2156.6
11:04:51.823 INFO  ProgressMeter -  NC_000913.3:2069666              3.2                  6990           2178.5
11:05:02.047 INFO  ProgressMeter -  NC_000913.3:2222980              3.4                  7510           2222.5
11:05:12.188 INFO  ProgressMeter -  NC_000913.3:2357668              3.5                  7960           2243.5
11:05:22.277 INFO  ProgressMeter -  NC_000913.3:2486668              3.7                  8390           2257.7
11:05:32.309 INFO  ProgressMeter -  NC_000913.3:2622832              3.9                  8850           2278.9
11:05:42.403 INFO  ProgressMeter -  NC_000913.3:2759594              4.1                  9310           2297.8
11:05:52.468 INFO  ProgressMeter -  NC_000913.3:2884593              4.2                  9730           2306.0
11:06:02.609 INFO  ProgressMeter -  NC_000913.3:3012443              4.4                 10160           2315.2
11:06:12.640 INFO  ProgressMeter -  NC_000913.3:3143696              4.6                 10600           2326.8
11:06:22.844 INFO  ProgressMeter -  NC_000913.3:3272395              4.7                 11030           2334.1
11:06:32.947 INFO  ProgressMeter -  NC_000913.3:3400379              4.9                 11460           2341.6
11:06:42.980 INFO  ProgressMeter -  NC_000913.3:3534218              5.1                 11910           2353.2
11:06:53.080 INFO  ProgressMeter -  NC_000913.3:3670538              5.2                 12370           2365.4
11:07:03.219 INFO  ProgressMeter -  NC_000913.3:3798008              5.4                 12800           2371.0
11:07:13.327 INFO  ProgressMeter -  NC_000913.3:3926285              5.6                 13230           2376.5
11:07:23.580 INFO  ProgressMeter -  NC_000913.3:4061835              5.7                 13690           2385.9
11:07:33.741 INFO  ProgressMeter -  NC_000913.3:4193508              5.9                 14130           2392.0
11:07:43.804 INFO  ProgressMeter -  NC_000913.3:4323124              6.1                 14570           2398.4
11:07:53.867 INFO  ProgressMeter -  NC_000913.3:4454885              6.2                 15010           2404.4
11:08:04.074 INFO  ProgressMeter -  NC_000913.3:4584445              6.4                 15450           2409.2
11:08:08.479 INFO  HaplotypeCaller - 32196 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
  32196 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
      32196 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
          32196 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
              32196 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
                  12774 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
                      12774 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
                          12774 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
                              12774 read(s) filtered by: MappingQualityReadFilter
                  19422 read(s) filtered by: NotDuplicateReadFilter

11:08:08.479 INFO  ProgressMeter -  NC_000913.3:4641445              6.5                 15641           2411.4
11:08:08.479 INFO  ProgressMeter - Traversal complete. Processed 15641 total regions in 6.5 minutes.
11:08:08.504 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.013602151000000002
11:08:08.504 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 2.339890145
11:08:08.504 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 4.57 sec
11:08:08.504 INFO  HaplotypeCaller - Shutting down engine
[2019年2月14日 上午11时08分08秒] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 6.53 minutes.
Runtime.totalMemory()=2526019584

(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306932
-rw-rw-r-- 1 xuquan xuquan 300295085 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan     14576 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan      3559 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan  13935329 2月  14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan     40656 2月  14 11:08 SRR1770413.g.vcf.idx

D3. 通过gvcf检测变异

(wgs) [xuquan@zhswlcserver:output]$gatk GenotypeGVCFs \

> -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa \
> -V /home/xuquan/conda/output/SRR1770413.g.vcf \
> -O /home/xuquan/conda/output/SRR1770413.vcf
Using GATK jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /home/xuquan/conda/db/E.coli_K12_MG1655.fa -V /home/xuquan/conda/output/SRR1770413.g.vcf -O /home/xuquan/conda/output/SRR1770413.vcf
11:12:15.631 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/xuquan/conda/softwares/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:12:17.502 INFO  GenotypeGVCFs - ------------------------------------------------------------
11:12:17.504 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
11:12:17.504 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
11:12:17.509 INFO  GenotypeGVCFs - Executing as xuquan@zhswlcserver on Linux v3.10.0-862.9.1.el7.x86_64 amd64
11:12:17.509 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
11:12:17.510 INFO  GenotypeGVCFs - Start Date/Time: 2019年2月14日 上午11时12分15秒
11:12:17.510 INFO  GenotypeGVCFs - ------------------------------------------------------------
11:12:17.510 INFO  GenotypeGVCFs - ------------------------------------------------------------
11:12:17.511 INFO  GenotypeGVCFs - HTSJDK Version: 2.18.2
11:12:17.511 INFO  GenotypeGVCFs - Picard Version: 2.18.25
11:12:17.512 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:12:17.512 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:12:17.512 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:12:17.512 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:12:17.512 INFO  GenotypeGVCFs - Deflater: IntelDeflater
11:12:17.513 INFO  GenotypeGVCFs - Inflater: IntelInflater
11:12:17.513 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
11:12:17.513 INFO  GenotypeGVCFs - Requester pays: disabled
11:12:17.513 INFO  GenotypeGVCFs - Initializing engine
11:12:17.935 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/xuquan/conda/output/SRR1770413.g.vcf
11:12:17.982 INFO  GenotypeGVCFs - Done initializing engine
11:12:18.051 INFO  ProgressMeter - Starting traversal
11:12:18.052 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
11:12:19.143 WARN  ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location NC_000913.3:378700 the annotation MLEAC=[2, 0] was not a numerical value and was ignored
11:12:23.758 INFO  ProgressMeter -  NC_000913.3:4628484              0.1                151422        1592518.8
11:12:23.758 INFO  ProgressMeter - Traversal complete. Processed 151422 total variants in 0.1 minutes.
11:12:23.773 INFO  GenotypeGVCFs - Shutting down engine
[2019年2月14日 上午11时12分23秒] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=2040528896
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 307040
-rw-rw-r-- 1 xuquan xuquan 300295085 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan     14576 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan      3559 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan  13935329 2月  14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan     40656 2月  14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan     89446 2月  14 11:12 SRR1770413.vcf
-rw-rw-r-- 1 xuquan xuquan     18469 2月  14 11:12 SRR1770413.vcf.idx
(wgs) [xuquan@zhswlcserver:output]$ll -lh
总用量 300M
-rw-rw-r-- 1 xuquan xuquan 287M 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan  15K 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan 3.5K 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan  14M 2月  14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan  40K 2月  14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan  88K 2月  14 11:12 SRR1770413.vcf
-rw-rw-r-- 1 xuquan xuquan  19K 2月  14 11:12 SRR1770413.vcf.idx

很快我们就获得了E.coli K12这个样本初步的变异结果:SRR1770413.vcf。之所以非要分成两个步骤,是因为“我”想借此告诉大家,变异检测不是一个样本的事情,有越多的同类样本放在一起 joint calling 结果将会越准确,而如果样本足够多的话,在低测序深度的情况下也同样可以获得完整并且准确的结果,而这样的分步方式是应对多样本的好方法。

D4. 压缩VCF并构建索引

最后,我们用bgzip对这个VCF进行压缩,并用tabix为它构建索引,方便以后的分析。

(wgs) [xuquan@zhswlcserver:output]$bgzip -f /home/xuquan/conda/output/SRR1770413.vcf
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306972
-rw-rw-r-- 1 xuquan xuquan 300295085 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan     14576 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan      3559 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan  13935329 2月  14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan     40656 2月  14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan     17295 2月  14 11:16 SRR1770413.vcf.gz
-rw-rw-r-- 1 xuquan xuquan     18469 2月  14 11:12 SRR1770413.vcf.idx
(wgs) [xuquan@zhswlcserver:output]$tabix -p vcf /home/xuquan/conda/output/SRR1770413.vcf.gz
(wgs) [xuquan@zhswlcserver:output]$ll
总用量 306976
-rw-rw-r-- 1 xuquan xuquan 300295085 2月  14 10:38 SRR1770413.aln.sorted.markdup.bam
-rw-rw-r-- 1 xuquan xuquan     14576 2月  14 10:42 SRR1770413.aln.sorted.markdup.bam.bai
-rw-rw-r-- 1 xuquan xuquan      3559 2月  14 10:38 SRR1770413.aln.sorted.markdup_metrics.txt
-rw-rw-r-- 1 xuquan xuquan  13935329 2月  14 11:08 SRR1770413.g.vcf
-rw-rw-r-- 1 xuquan xuquan     40656 2月  14 11:08 SRR1770413.g.vcf.idx
-rw-rw-r-- 1 xuquan xuquan     17295 2月  14 11:16 SRR1770413.vcf.gz
-rw-rw-r-- 1 xuquan xuquan       651 2月  14 11:17 SRR1770413.vcf.gz.tbi
-rw-rw-r-- 1 xuquan xuquan     18469 2月  14 11:12 SRR1770413.vcf.idx

至此,整个WGS数据分析就差不多了。由于E.coli基因组与人的基因组数据分析略有不同,后面还要继续通过人的基因组数据分析继续进行学习,并着手开始搭建自己的流程!


参考资料:GATK4.0和全基因组数据分析实践(上)

  • 8
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值