写在前面
以下内容均是在课堂中,根据老师讲解与帮助完成
部分文件和内容由老师提供和准备
多种安装方式中,老师已经帮我们提前安装好,这里就不再上机演示了,只演示bam to vcf 【bcftools】的步骤
1. bcftools的安装和使用
- bcftools 是一个用于操作和处理 VCF/BCF 文件的软件工具集
- 之前,是作为samtools的一部分;现在已经需要独立安装了
- 官方manual:
http://samtools.github.io/bcftools/
http://www.htslib.org/doc/bcftools.html
1.1. apt安装
- ubuntu系统和sudo权限
- sudo安装
sudo apt install bcftools
- 如果已经安装了
bcftools
,可以直接运行bcftools call
命令来执行您的操作。
bcftools call
命令用于对VCF格式的文件进行变异位点的调用。可以将需要进行调用的VCF文件作为输入,并将结果输出到标准输出或指定的输出文件中。
例如:
这将对bcftools call -O v -o output.vcf input.vcf
input.vcf
文件进行变异位点的调用,并将结果保存到output.vcf
文件中。
确保在运行bcftools call
命令之前,已经正确安装了bcftools
并且已经设置好了正确的环境变量。
1.2. 源代码编译安装
- 官网给出的编译方式
git clone --recurse-submodules https://github.com/samtools/htslib.git
git clone https://github.com/samtools/bcftools.git
cd bcftools
# The following is optional:
# autoheader && autoconf && ./configure --enable-libgsl --enable-perl-filters
make
-
克隆
htslib
和bcftools
的代码仓库:git clone --recurse-submodules https://github.com/samtools/htslib.git git clone https://github.com/samtools/bcftools.git
这将分别克隆
htslib
和bcftools
的代码仓库到当前目录。 -
进入
bcftools
目录:cd bcftools
在这个目录下,您可以选择执行以下可选步骤:
autoheader && autoconf && ./configure --enable-libgsl --enable-perl-filters
该步骤将生成配置文件,并根据需要启用
libgsl
和perl-filters
功能。如果不需要这些功能,可以跳过该步骤。 -
使用
make
命令编译bcftools
:make
编译成功后,您可以在
bcftools
目录中找到生成的可执行文件。
2. bcftools的使用
2.1. 基础使用
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.10.2 (using htslib 1.10.2-3ubuntu0.1)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
2.2. call、filter、view、stats
2.2.1. call:SNP/indel calling
- 用于对接bcftools mpileup命令
- 替代了以前的“bcftools view"
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with bcftools mpileup.
This command replaces the former "bcftools view" caller. Some of the original
functionality has been temporarily lost in the process of transition to htslib,
but will be added back on popular demand. The original calling model can be
invoked with the -c option.
Usage: bcftools call [options] <in.vcf.gz>
File format options:
--no-version do not append version and command line to the header
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
--ploidy <assembly>[?] predefined ploidy, 'list' to print available settings, append '?' for details
--ploidy-file <file> space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <list> list of samples to include [all samples]
-S, --samples-file <file> PED file or a file with an optional column with sex (see man page for details) [all samples]
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> use multithreading with <int> worker threads [0]
Input/output options:
-A, --keep-alts keep all possible alternate alleles at variant sites
-f, --format-fields <list> output format fields: GQ,GP (lowercase allowed) []
-F, --prior-freqs <AN,AC> use prior allele frequencies
-G, --group-samples <file|-> group samples by population (file with "sample\tgroup") or "-" for single-sample calling
-g, --gvcf <int>,[...] group non-variant sites into gVCF blocks by minimum per-sample DP
-i, --insert-missed output also sites missed by mpileup but present in -T
-M, --keep-masked-ref keep sites with masked reference allele (REF=N)
-V, --skip-variants <type> skip indels/snps
-v, --variants-only output variant sites only
Consensus/variant calling options:
-c, --consensus-caller the original calling method (conflicts with -m)
-C, --constrain <str> one of: alleles, trio (see manual)
-m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c)
-n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
-p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5]
-P, --prior <float> mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
Example:
# See also http://samtools.github.io/bcftools/howtos/variant-calling.html
bcftools mpileup -f reference.fa alignments.bam | bcftools call -mv -Ob -o calls.bcf
2.2.2. filter: filter VCF/BCF files using fixed thresholds
- bcftools filter 命令可以用于过滤 SNP 和 Indel
- -i参数指定的条件表达式,参考http://www.htslib.org/doc/bcftools.html#expressions
About: Apply fixed-threshold filters.
Usage: bcftools filter [options] <in.vcf.gz>
Options:
-e, --exclude <expr> exclude sites for which the expression is true (see man page for details)
-g, --SnpGap <int> filter SNPs within <int> base pairs of an indel
-G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
-i, --include <expr> include only sites for which the expression is true (see man page for details
-m, --mode [+x] "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
--no-version do not append version and command line to the header
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
-S, --set-GTs <.|0> set genotypes of failed samples to missing (.) or ref (0)
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> use multithreading with <int> worker threads [0]
2.2.3. view:
- 用于查看VCF文件的内容,包括样本信息、位点信息、注释信息等
About: VCF/BCF conversion, view, subset and filter VCF/BCF files.
Usage: bcftools view [options] <in.vcf.gz> [region1 [...]]
Output options:
-G, --drop-genotypes drop individual genotype information (after subsetting if -s option set)
-h/H, --header-only/--no-header print the header only/suppress the header in VCF output
-l, --compression-level [0-9] compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]
--no-version do not append version and command line to the header
-o, --output-file <file> output file name [stdout]
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-t, --targets [^]<region> similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
-T, --targets-file [^]<file> similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
--threads <int> use multithreading with <int> worker threads [0]
Subset options:
-a, --trim-alt-alleles trim ALT alleles not seen in the genotype fields (or their subset with -s/-S)
-I, --no-update do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
-s, --samples [^]<list> comma separated list of samples to include (or exclude with "^" prefix)
-S, --samples-file [^]<file> file of samples to include (or exclude with "^" prefix)
--force-samples only warn about unknown subset samples
Filter options:
-c/C, --min-ac/--max-ac <int>[:<type>] minimum/maximum count for non-reference (nref), 1st alternate (alt1), least frequent
(minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]
-f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.")
-g, --genotype [^]<hom|het|miss> require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes
-i/e, --include/--exclude <expr> select/exclude sites for which the expression is true (see man page for details)
-k/n, --known/--novel select known/novel sites only (ID is not/is '.')
-m/M, --min-alleles/--max-alleles <int> minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
-p/P, --phased/--exclude-phased select/exclude sites where all samples are phased
-q/Q, --min-af/--max-af <float>[:<type>] minimum/maximum frequency for non-reference (nref), 1st alternate (alt1), least frequent
(minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]
-u/U, --uncalled/--exclude-uncalled select/exclude sites without a called genotype
-v/V, --types/--exclude-types <list> select/exclude comma-separated list of variant types: snps,indels,mnps,ref,bnd,other [null]
-x/X, --private/--exclude-private select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples
2.2.4. stats: produce VCF/BCF stats
- bcftools stats 命令可以用于统计 SNP 和 Indel 的信息,如统计文件中 SNP 和 Indel 的数量:
About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats.
When two files are given, the program generates separate stats for intersection
and the complements. By default only sites are compared, -s/-S must given to include
also sample columns.
Usage: bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>]
Options:
--af-bins <list> allele frequency bins, a list (0.1,0.5,1) or a file (0.1\n0.5\n1)
--af-tag <string> allele frequency tag to use, by default estimated from AN,AC or GT
-1, --1st-allele-only include only 1st allele at multiallelic sites
-c, --collapse <string> treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
-d, --depth <int,int,int> depth distribution: min,max,bin size [0,500,1]
-e, --exclude <expr> exclude sites for which the expression is true (see man page for details)
-E, --exons <file.gz> tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed)
-f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.")
-F, --fasta-ref <file> faidx indexed reference sequence file to determine INDEL context
-i, --include <expr> select sites for which the expression is true (see man page for details)
-I, --split-by-ID collect stats for sites with ID separately (known vs novel)
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <list> list of samples for sample stats, "-" to include all samples
-S, --samples-file <file> file of samples to include
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
-u, --user-tstv <TAG[:min:max:n]> collect Ts/Tv stats for any tag using the given binning [0:1:100]
--threads <int> use multithreading with <int> worker threads [0]
-v, --verbose produce verbose per-site and per-sample output
3. bam to vcf 【bcftools】
3.1. Bam2Bcf
3.1.1. 命令示例
- 进入
~/bwa_test
目录cd ~/bwa_test
- 使用
bcftools mpileup
命令生成PCC7942_bwa.bcf文件bcftools mpileup -f GCA_000012525.1_ASM1252v1_genomic.fna PCC7942_bwa.bam.sorted > PCC7942_bwa.bcf
- 使用
less
命令查看该文件的内容less PCC7942_bwa.bcf
上述命令将基于参考基因组文件GCA_000012525.1_ASM1252v1_genomic.fna
和对应的BAM文件PCC7942_bwa.bam.sorted
生成PCC7942_bwa.bcf
文件,并使用less
命令查看该文件的内容。
生成BCF文件的过程可能需要一些时间,具体时间取决于输入文件大小和计算资源。
3.1.2. 上机演示
3.2. bcftolls检测变异位点
3.2.1. 命令示例
- 使用
bcftools
对PCC7942_bwa.bcf
文件进行变异位点的调用,并将结果保存到PCC7942_bwa.variant.bcf
文件中。bcftools call -vm PCC7942_bwa.bcf -o PCC7942_bwa.variant.bcf
- 使用
bcftools view
命令将PCC7942_bwa.variant.bcf
文件中的SNPs和indels过滤出来,并将结果保存到PCC7942_bwa.snps.vcf
文件中。bcftools view -v snps,indels PCC7942_bwa.variant.bcf > PCC7942_bwa.snps.vcf
- 使用
less
命令查看PCC7942_bwa.snps.vcf
文件的内容。less PCC7942_bwa.snps.vcf
生成和处理BCF/VCF文件的过程可能需要一些时间,具体时间取决于输入文件的大小和计算资源的情况。
3.2.2. 上机演示
3.3. 变异位点过滤
- 用filter命令
3.3.1. 命令示例
- 使用
bcftools filter
命令对PCC7942_bwa.snps.vcf
文件进行过滤操作,将满足QUAL大于20且DP大于5的变异位点筛选出来,并将结果保存到PCC7942_bwa.snps.filtered.vcf
文件中。bcftools filter -o PCC7942_bwa.snps.filtered.vcf -i 'QUAL>20 && DP>5' PCC7942_bwa.snps.vcf
- 使用
less
命令查看PCC7942_bwa.snps.filtered.vcf
文件的内容。less PCC7942_bwa.snps.filtered.vcf
过滤操作的时间取决于输入文件的大小和计算资源的情况。
3.3.2. 上机演示
上机演示内容后面补上,需要裁剪图片