生物信息linux编程,通过简单数据熟悉Linux下生物信息学各种操作4

原地址

几点说明

1.非简单翻译,所有代码均可运行,为了辅助理解,基本每步代码都有结果,需要比较的进行了整合

2.原文中的软件都下载最新版本

3.原文中有少量代码是错误的,这里进行了修正

4.对于需要的一些知识背景,在这里进行了注释或链接到他人博客

20 Pileup和Coverage

上面已经得到的mutation files

需要用到前面的内容

20.1 安装bcftools suite以便处理vcf文件(variant call formats)

cd src

curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.9/bcftools-1.9.tar.bz2

tar jxvf bcftools-1.9.tar.bz2

cd bcftools-1.9

make

ln -s ~/src/bcftools-1.9/bcftools ~/bin

每个run的平均coverage是什么

默认,samtools depth只输出非0 coverage的区域

samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/NR } '

73.8133

以下内容部分和前面重复

20.2 计算genome大小

samtools view -h bow.bam |head| samtools view -h bow.bam |head|grep @SQ

@SQ SN:NC_002549 LN:18959

samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/18959 } '

结果仍然是

73.7938

20.3 分别看有何无参考基因组的pileups

前已述及,不再展示图

samtools mpileup -Q 0 bwa.bam | more

samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

20.4samtools tview(samtools文本基因组浏览器)

samtools tview bwa.bam

1 11 21 31 41 51 61 71

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

CACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

CACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTG

CACAAAAAGAAAGAAGAATTTTTAGGATCTTTAGTGTGCGAATAACTATGAGGAATATTAATAATTTACCTCTC

AAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

AAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

AAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

GAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAAGTATGAGGAAGATTAATAATTTTCCTCTC

AGAAGAATTTTTAGGATCTTTTGTGTTCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

AAGAATTTTTAGGATCTTTTGTGTGGGAATAACTATGAGGAAGATTCAAAATTTTCCTCTC

AGAATTTTTAGTATCTTTTGAGTGCGACTAACTATGAGGAAGATTAATAATTTTCCTCTC

AATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCAC

AATTTTTAGGATCTTTTGTGTGCGAATCACTATGAGGAAGATTAATAATTTTCCTCTC

ATTATTAGGATCTTTTGTGTGCGAATAACTATGAGGACGATTAATAATTTTCCTCTC

TTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATACTTTTCCTCTC

TTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

TTAGGATCTTTTGTGTGCGAATAACTATGATGAAGATTAATAATTTTCCTCTC

AGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

TATTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

TTTTGTGTGCGAATAAGTATGAGGAAGATTAATAATTTTCCTCTC

TTTGTGTGCGAATAACTATGAGGAAGATTAATCATTTTCCTCTC

ATGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC

GTGCGAATAAGTATGCGGCAGATTAATAATTTTCCTCTC

TAACTATGAGGAAGATTAATAATTTTCCTCTC

TAACTATGAGGAAGATTAATAATTTTCCTCTC

CTATGAGGAAGATTAATAATTTTGCTCTC

ATGAGGAAGATTAATAATTTTCCTCTC

TGCGGAAGATTAATAATTTTCCTCTC

AGGAAGATAAATAATTTTCCTCTC

AGATTAATAATTTTCCTCTC

AGATTAATAATTTTCCTCTC

TTAATAATTTTCCTCTC

ATTTTCCTCTC

TTTTCCTCTC

TTTTCCTCTC

TTTCCTCTC

TCTC

CTG

C

20.5 对整个genome生成vcf

samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | more

部分结果如下

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT liver

NC_002549 5 . C 0 . DP=1;I16=1,0,0,0,17,289,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0 PL 0,3,17

NC_002549 6 . A 0 . DP=1;I16=1,0,0,0,17,289,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0 PL 0,3,17

NC_002549 7 . C 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,2,4,0,0;QS=1,0;MQ0F=0 PL 0,6,31

NC_002549 8 . A 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,4,10,0,0;QS=1,0;MQ0F=0 PL 0,6,31

NC_002549 9 . C 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,6,20,0,0;QS=1,0;MQ0F=0 PL 0,6,31

NC_002549 10 . A 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,8,34,0,0;QS=1,0;MQ0F=0 PL 0,6,31

NC_002549 11 . A 0 . DP=3;I16=3,0,0,0,51,867,0,0,180,10800,0,0,10,52,0,0;QS=1,0;MQ0F=0 PL 0,9,42

NC_002549 12 . A 0 . DP=4;I16=4,0,0,0,68,1156,0,0,240,14400,0,0,13,75,0,0;QS=1,0;MQ0F=0 PL 0,12,50

NC_002549 13 . A 0 . DP=5;I16=5,0,0,0,85,1445,0,0,300,18000,0,0,17,105,0,0;QS=1,0;MQ0F=0 PL 0,15,57

NC_002549 14 . A 0 . DP=5;I16=5,0,0,0,85,1445,0,0,300,18000,0,0,22,144,0,0;QS=1,0;MQ0F=0 PL 0,15,57

NC_002549 15 . G 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,27,193,0,0;QS=1,0;MQ0F=0 PL 0,18,62

NC_002549 16 . A 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,33,253,0,0;QS=1,0;MQ0F=0 PL 0,18,62

NC_002549 17 . A 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,39,325,0,0;QS=1,0;MQ0F=0 PL 0,18,62

NC_002549 18 . A 0 . DP=7;I16=7,0,0,0,119,2023,0,0,420,25200,0,0,45,409,0,0;QS=1,0;MQ0F=0 PL 0,21,66

NC_002549 19 . G 0 . DP=7;I16=7,0,0,0,119,2023,0,0,420,25200,0,0,52,506,0,0;QS=1,0;MQ0F=0 PL 0,21,66

NC_002549 20 . A 0 . DP=8;I16=8,0,0,0,136,2312,0,0,480,28800,0,0,59,617,0,0;QS=1,0;MQ0F=0 PL 0,24,69

NC_002549 21 . A 0 . DP=9;I16=9,0,0,0,153,2601,0,0,540,32400,0,0,67,743,0,0;QS=1,0;MQ0F=0 PL 0,27,72

20.6 生成genotypes然后call variants

samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vc > samtools.vcf

查看snp calls,知道snp calling的原理

这里原作者写了一个python脚本名字为snpcaller.py.但我还没找到

代码先放这里,但不影响后续分析。做个

记号

cat bwa.pileup | python snpcaller.py > diy.vcf

cat samtools.vcf |grep -v "##"|cut -f 1-5|head

#CHROM POS ID REF ALT

NC_002549 425 . G T,A

NC_002549 507 . G T

NC_002549 2846 . a aT

NC_002549 2847 . g gT

NC_002549 7894 . A C

NC_002549 9569 . G A

NC_002549 12101 . T A

NC_002549 12957 . G A

NC_002549 14086 . A G

20.7安装freebays以call variants

首先安装Freebays,mac需要另外一个cmake才可以对freebays进行编译,所以先安装cmake

brew install cmakebrew install cmake

cd ~/src

git clone --recursive git://github.com/ekg/freebayes.git

cd freebayes

make

注意原文中说,这个地方有个bug,make第一次会报错,然后再make一次就OK。我这里还是显示如下错误

fatal error: 'lzma.h' file not found

通过如下方式解决

brew install xz

做链接

ln -sf ~/src/freebayes/bin/freebayes ~/bin

20.8用安装的FreeBayes call snps

freebayes -f ~/refs/852/NC.fa bwa.bam > freebayes.vcf

对结果进行可视化。关于vcf格式请参考这篇文章VCF格式

Lecture 22变异效应预测

安装bioawk

cd ~/src

git clone https://github.com/lh3/bioawk.git

cd bioawk

make

ln -sf ~/src/bioawk/bioawk ~/bincd ~/src

git clone https://github.com/lh3/bioawk.git

cd bioawk

make

ln -sf ~/src/bioawk/bioawk ~/bin

bioawk的manual在这里

man ~/src/bioawk/awk.1

bioawk能辨识生物信息学类型。

cat r1.fq | bioawk -c fastx ' { print $seq } ' | head -1

cat mutations.gff | bioawk -c gff ' { print $seqname, $start } ' | head -1

cat mutations.gff | bioawk -c gff ' { print $seqname, $end-$start+1 } ' | head -1

下载snpEff,注意原文中下载地址有错

curl -OL https://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip/download

可以向trimmomatic和readseq一样设置snpEff,现在用alias的方式设置

alias snpEff='java -jar ~/src/snpEff/snpEff.jar'

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值