WES分析7-VCF

VCF

方式一:samtools mpileup 和bcftools call 流程

ref=/home/lzn/WES/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
wkd=/home/lzn/WES
mkdir $wkd/vcf
for sample in `cat $wkd/sample.list`
do
samtools mpileup -ugf $ref  $wkd/align/$sample.sorted.dedup.recal.bam | bcftools call -vmO v -o $wkd/vcf/$sample.bcf.vcf
done

方式二:GATK的HaplotypeCaller流程

ref=/home/lzn/WES/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
dbsnp=/home/lzn/WES/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
wkd=/home/lzn/WES

for sample in `cat $wkd/sample.list`
do
gatk HaplotypeCaller \ #此处不能用HaplotypeCallerSpark
-R $ref \
-I $wkd/align/$sample.sorted.dedup.recal.bam \
#--dbsnp $dbsnp \  java.lang.IllegalArgumentException: HaplotypeCallerSpark does not yet support -D or --dbsnp arguments
-O $wkd/vcf/$sample.gatk.vcf \
1>$sample.log 2>&1
done
********************************************************************************
#10:42:25.938 INFO  HaplotypeCallerSpark - The output of this tool DOES NOT match the output of HaplotypeCaller. 
#10:42:25.938 INFO  HaplotypeCallerSpark - It is under development and should not be used for production work. 
#10:42:25.938 INFO  HaplotypeCallerSpark - For evaluation only.
#10:42:25.938 INFO  HaplotypeCallerSpark - Use the non-spark HaplotypeCaller if you care about the results. 
********************************************************************************

方式三:Mutect2

#GSM2653854	HCC1-Tissue		GSM2653855	HCC3-Tissue
#GSM2746362	Healthy1-Tissue

wkd=/home/lzn/WES
ref=/home/lzn/WES/genome/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
mkdir $wkd/mutect


gatk --java-options "-Xmx16G -Djava.io.tmpdir=./" \
	Mutect2 \
	 -R $ref \
     -I $wkd/align/GSM2653854.sorted.dedup.recal.bam \
     -I $wkd/align/GSM2746362.sorted.dedup.recal.bam\
     -normal GSM2746362 \
     #--panel-of-normals $wkd/mutect/GSM2746362.vcf \  已经分析得到的vcf文件
     --af-of-alleles-not-in-resource 1e-6 \
     -O $wkd/mutect/GSM2653854.vcf \
     1>$wkd/mutect/log.txt 2>&1

#Starting with v4.0.4.0, GATK recommends the default setting of --af-of-alleles-not-in-resource, which the tool dynamically adjusts for different modes. tumor-only calling sets the default to 5e-8, tumor-normal calling sets it to 1e-6 and mitochondrial mode sets it to 4e-3. For previous versions, the default was 0.001, the average heterozygosity of humans. For other organisms, change --af-of-alleles-not-in-resource to 1/(ploidy*samples in resource

gatk GetPileupSummaries \
-I $wkd/align/GSM2653854.sorted.dedup.recal.bam \
-V $wkd/mutect/GSM2653854.vcf \
-L $wkd/genome/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-O $wkd/mutect/GSM2653854.pileups.table

#A USER ERROR has occurred: Bad input: Population vcf does not have an allele frequency (AF) info field in its header.
# #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
 chr6	29942512	.	G	C	2974860	VQSRTrancheSNP99.80to99.90	AF=0.063
 chr6	29942517	.	C	A	2975860	VQSRTrancheSNP99.80to99.90	AF=0.062
 chr6	29942525	.	G	C	2975600	VQSRTrancheSNP99.60to99.80	AF=0.063
 chr6	29942547	rs114945359	G	C	15667700	PASS	AF=0.077
#生成符合格式的vcf文件
less -Sn resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz |grep -v ^## | sed 's/;/\t/g' | awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 "\t" $9}' > resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf

gatk GetPileupSummaries \
-I $wkd/align/GSM2653854.sorted.dedup.recal.bam \
-V $wkd/mutect/GSM2653854.vcf \
-L $wkd/genome/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf \
-O $wkd/mutect/GSM2653854.pileups.table
#A USER ERROR has occurred: Couldn't read file /home/lzn/WES/genome/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf. Error was: The file /home/lzn/WES/genome/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf exists, but does not contain Features (ie., is not in a supported Feature file format such as vcf, bcf, bed, or interval_list), and does not have one of the supported interval file extensions 
head resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf |less -Sn
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  
chr1    51479   rs116400033     T       A       11726.81        PASS    AF=0.3253
chr1    55367   .       G       A       207.20  PASS    AF=0.00117
chr1    55388   .       C       T       95.61   PASS    AF=0.00056
#header缺少INFO
#修改后依然报错

zcat resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz|grep ^\#\# >tmp.txt
cat tmp.txt resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.biallelic.vcf >>tmp2.vcf

#TribbleException$MalformedFeatureFile: Error parsing LineIteratorImpl(SynchronousLineReader) just after record at: chr1:173904011-173904011, for input source: /home/lzn/WES/genome/tmp2.vcf

#先跳过这部分:Calculate Contamination
#Tools involved: GetPileupSummaries, CalculateContamination

HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants.Neither is appropriate for the other use case.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 首先,我们需要用Python代码来实现WES生物信息分析的流程: 1. 从原始的WES数据中提取基因序列:import pandas as pd# Read in WES data data = pd.read_csv("wes_data.csv")# Extract gene sequences gene_seqs = data['gene_sequences']2. 用基因序列进行特征提取:# Extract features from gene sequences features = extract_features(gene_seqs)3. 对特征进行聚类分析:# Perform clustering on extracted features clusters = cluster(features)4. 对聚类结果进行可视化:# Visualize clustering results visualize(clusters) ### 回答2: 搭建一个 WES(全外显子组测序)生物信息分析流程主要包括以下几个步骤:数据预处理、比对、变异检测与注释。 1. 数据预处理: 在这一步骤中,需要对原始的 WES 测序数据进行质量控制和去除低质量的序列。可以使用软件如FastQC进行质量评估,然后使用Trimmomatic、Cutadapt等工具进行质量修剪和去除接头序列。 2. 比对: 将预处理后的测序数据与参考基因组进行比对。常用的比对工具有Bowtie2、BWA、STAR等。可以通过构建索引、对测序数据进行比对,得到比对结果的sam/bam文件。 3. 变异检测: 基于比对结果,进行变异检测。可以使用GATK、FreeBayes等工具进行单核苷酸变异(SNV)和小片段插入/缺失(indels)的检测。该过程需要考虑到测序数据的深度、去除假阳性突变等技术细节。 4. 注释: 对检测到的变异进行进一步的注释。常用的工具有VariantEffectPredictor(VEP)、ANNOVAR等。这些工具可以提供SNV和indels的功能注释,包括寻找变异的功能和表达对它们可能的影响。 下面给出一个简单的示例代码(使用Python和一些开源工具): ```python # 数据预处理 QualityControl(input_fastq, output_fastq) # 比对 alignment_tool = "bowtie2" # 比对工具 reference_genome = "hg38.fa" # 参考基因组 Alignment(input_fastq, alignment_tool, reference_genome, output_sam) # 变异检测 variant_caller = "GATK" # 变异检测工具 VariantCalling(output_sam, variant_caller, output_vcf) # 注释 annotation_tool = "VEP" # 注释工具 VEP_Annotation(output_vcf, annotation_tool, output_annotated_vcf) ``` 需要根据具体的需求和数据类型进行参数配置和工具选择,并且需要确保正确安装和配置相应的软件和依赖。以上示例仅供参考,实际过程中还需要考虑其他因素,如样本质量控制、批次效应等。 ### 回答3: 要搭建一个WES(全外显子测序)生物信息分析流程,可以采用以下步骤,并提供相应代码示例: 1. 数据质量控制(Quality Control, QC):使用FastQC对原始测序数据进行质量评估,检查测序质量、碱基分布、测序错误等。 ```bash fastqc input.fq.gz -o output_directory ``` 2. 数据预处理(Preprocessing):使用Trimmomatic去除低质量的测序数据和接头序列。 ```bash java -jar trimmomatic.jar PE -phred33 input_1.fq.gz input_2.fq.gz output_1.fq.gz output_2.fq.gz ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ``` 3. 序列比对(Alignment):使用Burrows-Wheeler Aligner (BWA)对预处理后的测序数据进行比对。 ```bash bwa mem reference.fa input_1.fq.gz input_2.fq.gz > aligned.sam ``` 4. 去除PCR重复序列(PCR Duplicate Removal):使用Picard tools去除PCR产生的重复序列。 ```bash java -jar picard.jar MarkDuplicates I=aligned.sam O=deduplicated.bam M=deduplication_metrics.txt ``` 5. 变异位点检测(Variant Calling):使用GATK (Genome Analysis Toolkit)进行变异位点的检测。 ```bash java -jar gatk.jar HaplotypeCaller -R reference.fa -I deduplicated.bam -O variants.vcf ``` 6. 变异位点筛选(Variant Filtering):使用bcftools对变异位点进行过滤。 ```bash bcftools filter -i 'QUAL > 30' variants.vcf > filtered_variants.vcf ``` 以上是一个简单的WES生物信息分析流程的步骤和代码示例,实际分析流程可能还涉及到其他步骤和工具,具体根据研究目的和数据情况进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值