实验记录 | somatic.pl运行1

最新推荐文章于 2024-04-20 12:06:02 发布

今天也是个妖精头子呀

最新推荐文章于 2024-04-20 12:06:02 发布

阅读量885

点赞数

分类专栏：谱系追踪

本文链接：https://blog.csdn.net/weixin_40640700/article/details/116739545

版权

谱系追踪专栏收录该内容

82 篇文章 11 订阅

订阅专栏

1。运行somatic.pl

参考链接：
https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/example/example_dataset/sequencing

输入数据
流程输入文件的格式：fastq文件或者bam文件，或者两者的混合。

测试数据：
https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/example/example_dataset/sequencing.

输出内容
对于提供的一对正常/肿瘤样本，识别其中的体细胞突变与生殖细胞突变（ somatic and germline mutation ）。
somatic.pl

perl /Path/to/somatic.pl <normal_fastq1> <normal_fastq2/NA> <tumor_fastq1> <tumor_fastq2/NA> </Path/to/output> <disambiguate_pipeline>

这是官网给出的代码实例。

其中perl是系统中的perl指令，用于运行.pl代码文件。因此，其后的somatic.pl就是我们本次要执行的文件。
/path/to/somatic.pl具体指的就是，我们本次要运行的代码所在的位置，然后尝试去调用。
normal_fastq1，tumor_fastq1等输入文件，指的是要输入的正常样本的fastq1与肿瘤样本的fastq（如果是双端测序的话，需要提交fastq2文件，如果是单端测序，第2,4位以NA表示）,也就是说我们如何找突变呢？就是拿突变的样本与正常的样本来比较，从而在其中找到突变位点，没有参照，就没有对与错之分。
注意，正常样本与肿瘤样本的提交位置不要搞混。
thread推荐使用的是32。
build可以是hg19,hg38或者mm10（索引的名称）
index指的是参考基因组的存放路径。
java17指的是java1.7的路径。
output指的是输出文件夹的路径。
pdx物种的名称，可以是 “PDX”，“human” 或 “mouse”
keep_coverage指的是，是否需要保留每一个碱基的覆盖率信息。默认值0是不保留，设置为1则为保留。
disambiguate_pipeline: the directory to disambiguate_pipeline which is for distinguishing human and mouse reads in PDX data.（我不是很理解PDX是什么意思）

可以先用示例数据跑一下这个流程，看看得到的是什么样的结果。

给出的这个示例数据的ID为SRR7246238。
这个数据是一个双端的线粒体DNA数据（由于数据量比较小，所以本次实验就以此为测试集来运行）。

Study: Human lineage tracing enabled by mitochondrial mutations and single cell genomics [TF1_clones_scMito]
PRJNA474190 • SRP149537 • All experiments • All runs
show Abstract
Sample: TF1_G10_29_S285 [TF_clones_scMito]
SAMN09293216 • SRS3364632 • All experiments • All runs
Organism: Homo sapiens
Library:
Instrument: NextSeq 500
Strategy: OTHER
Source: GENOMIC
Selection: other
Layout: PAIRED
Construction protocol: Qiagen TCL lysis buffer Qiagen Mito-RepliG Single cell mitochondrial DNA-seq, Nextera
Experiment attributes:
GEO Accession: GSM3169705
Links:
Runs: 1 run, 210,234 spots, 16M bases, 6.3Mb
Run # of Spots # of Bases Size Published
SRR7246238 210,234 16M 6.3Mb 2019-02-28

不知道这个数据所属的样本是健康人群还是癌症人群？总觉得很难对应上去。我就先把它当作是癌症样本来进行。
所以目前确定好了前7个参数。
perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32

下面两个参数是关于索引的名称与索引的路径的问题。看到示例的代码，索引的路径下的文件是fa格式。所以应该使用的并不是samtools所建立的索引文件。因此，现在考虑去网站上下载参考基因组。
由于所用的实验数据采样的物种为“人类”，因此，首选hg19作为参考基因组。
（一定要踏踏实实，认认真真的，就当它是自己生命的一种记录。）
mkdir genome/hg19
cd genome/hg19
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar zvfx chromFa.tar.gz

注释掉下面两行，暂时不用。
#cat *.fa > hg19.fa
#rm chr*.fa

这个参考基因组还是挺大的，有905MB。
因为我们的数据是线粒体基因组，所以我想尝试使用chrM.fa作为参照，是不是会快一点，这是我的一个有趣的尝试，之前没有这样操作过。
那么参数继续书写。
perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 mtDNA /home/zxx/QBRC/geneome/hg19/chrM.fa

下面要写的java1.7的路径。但是我的系统中的java是1.8的，难道不行吗（觉得这种要求很吐血）?
使用whereis指令,查找java的位置。
whereis java

/usr/lib/jvm/jdk1.8.0_181/bin/java

但是我的java版本是1.8.0不知道符不符合要求，我决定先试一试，如果报错再作打算。

继续填写参数。
perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 mtDNA /home/zxx/QBRC/geneome/hg19/chrM.fa /usr/lib/jvm/jdk1.8.0_181/bin/java

接着要填写的是结果文件的输出路径。这个比较好办，我创建一个output文件夹。

/home/zxx/QBRC/example/example_dataset/output

接着就是，PDX 为human,keep_coverage的参数为1（保留），最后的disambiguate_pipeline没怎么弄懂，好像也并不是特别需要，所以就默认与示例代码一致,填写disambiguate_pipeline文件夹在我的系统中的绝对路径。

/home/zxx/QBRC/disambiguate_pipeline

所以，综上，我的本次操作的完整代码为：
perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 mtDNA /home/zxx/QBRC/geneome/hg19/chrM.fa /usr/lib/jvm/jdk1.8.0_181/bin/java human 1 /home/zxx/QBRC/disambiguate_pipeline

最后，在somatic.pl这个文件所在的路径下运行，上述代码，查看运行的结果是否顺利（肯定不顺利）。

出现了第一个错误：

Can’t locate Parallel/ForkManager.pm in @INC (you may need to install the Parallel::ForkManager module)

提示可能是需要安装Parallel::ForkManager这个模块。
之前咱们安装过模块。有过类似的经验。
现在着手安装模块。

Parallel::ForkManager是perl中的一个模块。

使用cpan尝试进行安装。
参考链接：https://www.codenong.com/16162539/
cpan Parallel::ForkManager

然后，屏幕弹出代码若干，执行下去之后，即可以比较顺利的安装完成该模块。

继续执行原始代码，又再次出现了错误。
我们来认真的梳理以下，这个过程中遇到的错误。

bwa mem -v 1 -t 32 -a -M /home/zxx/QBRC/geneome/hg19/chrM.fa human/tumor/fastq1.fastq human/tumor/fastq2.fastq > human/tumor/alignment.sam
[E::bwa_idx_load_from_disk] fail to locate the index files

出现两次。

Error: Unable to access jarfile /home/zxx/QBRC//somatic_script/picard.jar

mv: cannot stat ‘human/tumor/dupmark.bai’: No such file or directory

human/tumor/rgAdded.bam doesn’t exist!
Error: At least one bam file doesn’t exist!

遇到出错不要着急，要一点点的来解决。

第一个问题，可能就反映到我之前的操作步骤中去了。我做了一个有趣的尝试，但是这个代码并不是特别的灵活（就像是一个规规矩矩的孩子，换一个环境条件，就出现了这样那样的问题）。

因为我们的数据是线粒体基因组，所以我想尝试使用chrM.fa作为参照，是不是会快一点，这是我的一个有趣的尝试，之前没有这样操作过。

因此，我需要继续把所有的.fa格式的文件合并，最后命名为一个总的索引文件。于是，在前面步骤的基础上，执行代码：
cat *.fa > hg19.fa
rm chr*.fa

现在索引文件夹中只剩下了hg19.fa的文件。
修改代码(*将索引部分修改了)，再次运行。

perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 hg19 /home/zxx/QBRC/geneome/hg19/hg19.fa /usr/lib/jvm/jdk1.8.0_181/bin/java human 1 /home/zxx/QBRC/disambiguate_pipeline

由于刚刚的报错里面，还存在报错：

Error: Unable to access jarfile /home/zxx/QBRC//somatic_script/picard.jar

我之前又安装过picard，所以猜测直接从官网下载的这个文件夹，picard并未解压运行。
现在有两种解压的办法：
（1）修改作者自己的源代码。
（2）将自己已经安装好的picard放入到该路径下。

目前，采取第二种方法。
查找之前的实验记录：

java -jar build/libs/picard.jar

找到了picard.jar的路径，将其直接复制到somatic_script文件夹下（不知道如此鲁莽的复制，是否需要其他的依赖性的文件）。

上述操作完成之后，重新执行修改后的代码。

嗯，这次代码执行之后，时间长了一会儿，但是还是无情的报错。
没关系，我继续修正。

GATK

ERROR MESSAGE: Could not read file /home/zxx/QBRC/geneome/hg19/hg19.fa_resource/Mills_and_1000G_gold_standard.indels.hg19.vcf because file ‘/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/Mills_and_1000G_gold_standard.indels.hg19.vcf’ does not exist

ERROR MESSAGE: Could not read file /home/zxx/QBRC/geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf because file ‘/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf’ does not exist

关于索引的问题

ERROR MESSAGE: Fasta index file /home/zxx/QBRC/geneome/hg19/hg19.fa.fai for reference /home/zxx/QBRC/geneome/hg19/hg19.fa does not exist.
Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

我觉得自己可以先入手解决索引的问题。
检索网站：http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

还是要使用samtools构建索引。
samtools faidx hg19.fa

建立索引的过程会非常的长，这次建好索引之后，就把索引文件上传到网盘上。
或者我现在可以省点力气，直接去看是否有可以直接下载的索引文件。
我好想记错了，好像并没有用很长时间。
现在fa.fai结尾的索引文件已经在文件目录下了。
ls

hg19.fa hg19.fa.fai

重新运行代码。

这次虽然也报了其他的错误，关于索引的，又出现了新的问题(.dict结尾的文件是我没见过的)。

ERROR MESSAGE: Fasta dict file /home/zxx/QBRC/geneome/hg19/hg19.dict for reference /home/zxx/QBRC/geneome/hg19/hg19.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

查找官网，发现.dict结尾的文件，是由gatk-launch这个指令产生的。
具体如下：

gatk-launch CreateSequenceDictionary -R hg19.fa

但是，我的gatk好像运行不了这个指令。究竟是为什么呢？

参考链接：https://www.biostars.org/p/378793/
总结：就是gatk这个指令在新版的gatk（我下载的是4.2.0.0）已经被gatk更换。而原始的教程并没有在这块更进。估计作者也早就忘记有这么一回事儿。所以，轮到我们使用的时候，就存在这样一个信息的断层。这是最抓狂的地方。连更新日志都没有。使用者需要考虑到版本更换的问题。

换用gatk试试看。

gatk CreateSequenceDictionary -R hg19.fa

运行成功了，在文件夹下发现了.dict结尾的文件。
好，现在索引的问题，解决好了。
重新运行代码。
perl somatic.pl NA NA example/example_dataset/sequencing/SRR7246238_1.fastq.gz example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 hg19 /home/zxx/QBRC/geneome/hg19/hg19.fa /usr/lib/jvm/jdk1.8.0_181/bin/java human 1 /home/zxx/QBRC/disambiguate_pipeline

意料之中，又出了一堆的错误。继续整理（我心态很好的）。

主要出错的地方在于gatk这部分，我就想将我的gatk的内容移动到这里面。

没有任何改变。

那就认真的分析出错的内容吧。

2021年 05月 13日星期四 19:06:47 CST

ERROR MESSAGE: Could not read file /home/zxx/QBRC/geneome/hg19/hg19.fa_resource/Mills_and_1000G_gold_standard.indels.hg19.vcf because file ‘/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/Mills_and_1000G_gold_standard.indels.hg19.vcf’ does not exist

ERROR MESSAGE: Could not read file /home/zxx/QBRC/geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf because file ‘/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf’ does not exist

ERROR MESSAGE: Could not read file /home/zxx/QBRC/human/tumor/tumor_bqsr because it does not exist

human/tumor/realigned.bam doesn’t exist!
Error: At least one bam file doesn’t exist!

总结一下，就是在这个过程中，一些过程性的文件缺乏。于是报错。

但是，我们如何去追踪这个过程呢？我好像卡住了。

继续更新。
用关键词“Mills_and_1000G_gold_standard.indels.hg19.vcf”，“dbsnp.hg19.vcf”在搜索引擎中检索。
找到链接：http://www.bio-info-trainee.com/838.html等，
得知这部分的文件在：ftp:broadinstitute.org/bundle/ 这个链接中（大多数教程都是如此）。
但是，使用这个ftp地址打不开。
打不开的主要操作是：无论是使用哪一种浏览器，均显示为"连接超时"的字样。
而在“我的电脑”中打开这个地址，则是类如“您是否有权限打开”的字样。
另外，转换了思路，以“broadinstitute.org/bundle/”为关键词进行检索。
发现链接：https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
通过这篇文章的介绍，我明白了，正取的做法应该是：

在浏览器/我的电脑中输入以下地址，以匿名用户的方式打开。

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/

在这里插入图片描述
于是，在以上的地址中下载我们报错的过程中出现的文件即可。

我们需要文件：

文件名	存放地址
Mills_and_1000G_gold_standard.indels.hg19.vcf	/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/
dbsnp.hg19.vcf	/home/zxx/QBRC/geneome/hg19/hg19.fa_resource/

名字有点不太一样。
尝试下载文件
（1）Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
（2）dbsnp_138.hg19.vcf.gz

下载的过程超级慢。目前的下载速度为0……又是什么情况？
下载这个文件的过程并不怎么顺利。
我实在是受不了window上，用链接下载的方式了。
切换用Linux的wget指令在处理。
wget ftp.broadinstitute.org/bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \ --ftp-user "gsapubftp-anonymous" \ --ftp-password ""

wget ftp.broadinstitute.org/bundle/dbsnp_138.hg19.vcf.gz \ --ftp-user "gsapubftp-anonymous" \ --ftp-password ""

参考链接：https://www.jianshu.com/p/155882b3005a
https://blog.csdn.net/xietansheng/article/details/84669662

Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.md5
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz.md5

dbsnp_138.hg19.vcf.gz
dbsnp_138.hg19.vcf.gz.md5
dbsnp_138.hg19.vcf.idx.gz
dbsnp_138.hg19.vcf.idx.gz.md5

–2021-05-13 22:28:21-- http://ftp.broadinstitute.org/bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
Resolving ftp.broadinstitute.org (ftp.broadinstitute.org)… 69.173.70.223
Connecting to ftp.broadinstitute.org (ftp.broadinstitute.org)|69.173.70.223|:80… failed: Connection timed out.
Retrying.
–2021-05-13 22:30:32-- (try: 2) http://ftp.broadinstitute.org/bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
Connecting to ftp.broadinstitute.org (ftp.broadinstitute.org)|69.173.70.223|:80… failed: Connection timed out.
Retrying.
–2021-05-13 22:32:44-- (try: 3) http://ftp.broadinstitute.org/bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
Connecting to ftp.broadinstitute.org (ftp.broadinstitute.org)|69.173.70.223|:80…

今天暂时停一下。

继续更新，来解决这件事（周五，9:53）。

在知乎上看到一个介绍，说使用一个工具可以下载ftp地址上的文件，最常用的是在window平台。
https://zhuanlan.zhihu.com/p/61377034
https://wiki.filezilla-project.org/Client_Installation
https://themebetter.com/ftp-filezilla.html

于是，我现在打算去window平台上安装它，再次尝试是否能够下载好。这个文件是非常必须的。所以我务必解决。

刚刚一上午的代码丢失了，现在心情崩的不行。
总结一下，我的尝试。

首先下载并使用了FileZilla。
尝试连接服务器，无果。
显示服务器连接错误。
其次，使用我的电脑打开ftp地址。很幸运能够打开。
但是，当我试图将文件复制到本地时，报错。
有一些文件能够下载下来，但是最终的两个挺大的文件，一直下载不了。
而且下载了两次，往往是在接近下载完成的情况下，显示文件错误。

心态崩的不行。

后来，又找到了一个网站，觉得或许会有h19的这个文件，但是后来发现只有hg38的，而我还需要再次创建索引，很是麻烦。

现在心态崩的不行，路在何方。

今天也是个妖精头子呀

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
实验记录 | somatic.pl运行1

1。ORBC mutation pipeline参考链接：https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/example/example_dataset/sequencing.输入数据流程输入文件的格式：fastq文件或者bam文件，或者两者的混合。测试数据：https://github.com/Somatic-pipeline/QBRC-Somatic-Pipeline/tree/master/
复制链接

扫一扫