实验记录 | 6/30-7/4

今天也是个妖精头子呀

于 2021-07-05 09:31:18 发布

阅读量554

点赞数

分类专栏：谱系追踪

本文链接：https://blog.csdn.net/weixin_40640700/article/details/118356351

版权

谱系追踪专栏收录该内容

82 篇文章 12 订阅

订阅专栏

参考链接：https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2
参照官网得mutect2的示例代码，运行（随意）：

 ./gatk Mutect2 -R /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.fa -I /home/xxzhang/workplace/output/tumor.bam -O mutect.vcf

出现错误：

Error: Invalid or corrupt jarfile /home/xxzhang/workplace/QBRC/somatic_script/gatk-package-4.2.0.0-local.jar

后来，通过conda安装完成。
现在尝试将mutect1转换为mutect2。在代码文件中的指令为：

/home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java -Djava.io.tmpdir=./output_RNA/mutmp -Xmx31g -jar /home/xxzhang/workplace/QBRC//somatic_script/mutect-1.1.7.jar --analysis_type MuTect --reference_sequence ./geneome/hg19/hg19.fa --dbsnp ./geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf --cosmic ./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf  --input_file:tumor /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam --input_file:normal /home/xxzhang/workplace/QBRC/output_RNA/normal/normal.bam --vcf ./output_RNA/mutect.vcf --out ./output_RNA/mutect.out

而在代码文件中，相对应的指令为，

system_call($java17." -Djava.io.tmpdir=".$mutect_tmp." -Xmx31g -jar ".$mutect." --analysis_type MuTect --reference_sequence ".$index.
  " --dbsnp ".$resource_dbsnp.$known." --input_file:tumor ".$tumor_bam." --input_file:normal ".$normal_bam.
  " --vcf ".$output."/mutect.vcf --out ".$output."/mutect.out");

对应的代码文件是这个样子的，看来要大改。
我们看一下gatk关于mutect2的指令有哪些？
参考链接：https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2

gatk Mutect2 -R ./geneome/hg19/hg19.fa -I /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam -I  /home/xxzhang/workplace/QBRC/output_RNA/normal/normal.bam 
-O ./output_RNA/mutect.vcf

所以，对应到我们的perl代码应该是：

system_call("gatk Mutect2  -Djava.io.tmpdir=".$mutect_tmp."-R	".$index.
" -I	".$tumor_bam." -I	".$normal_bam.
"-O	".$output."/mutect.vcf ");

如上修改完成之后，看一下我们这个轮子是否可以正常的运转。
找一下之前调用的那个代码。

今天一天的效率特别高，主要解决了这几天一直觉得窝心的（觉得搞不定的）两件事！
（1）将mutect2成功移植到了somatic.pl这个指令当中。
（2）修改作者的批处理的指令，换用自己的理解方式，将其写入（虽然批处理的命令的提交系统依旧没怎么弄明白）。

#!/usr/bin/perl
use strict;
use warnings;
use Cwd 'abs_path';
my ($jobs,$example,$thread,$build,$index,$java17,$keep_coverage,$n,$disambiguate)=@ARGV;  #导入参数，注意对应关系
my ($line,$line1,@linelist,@items,$i,$job); #在这个perl文档中自定义的一些参数
my $path=abs_path($0); #获取当前命令行程序所在的路径，并将其作为$path的值
$path=~s/job_somatic\.pl//; #去除job_somatic.pl，得到当前工作路径的文件夹
print($path);
open(JOB,$jobs) or die "Cannot find the design file!\n"; #$jobs应该与job_somatic.pl这个文件在同一个路径下
@linelist=<JOB>;
$job="somatic_calling_".$n.".sh"; #作者开头自定义的一个变量$job;
#system("mkdir ".$job);
system_call("touch ".$job);
open(SCRIPT,">",$job) or die "Cannot write to the shell submission script!\n"; #打开并写入文件，只是为什么创建不了呢？
open(HEADER,$example) or die "Cannot find the example shell script!\n"; #打开我们的批处理的指令文件

while ($line1=<HEADER>) #整行的读取 #这个好像是全部读取
{
          if ($line1=~/JOBSTART/)
          {last;} #不是很理解这个last的意思 last的意思是退出循环语句块
          #print $line1;
          print SCRIPT $line1; #将他的每一行都输入到somatic_calling这个文件中
}
close(HEADER); #关闭这个文件，这个文件的革命任务完成
$i=1;
foreach $line(@linelist){

        @items=split("\t",$line); #回到之前，我们$jobs打开还没有，把他关上，这里继续处理它
        print SCRIPT "perl\t".$path."/somatic.pl\t".$items[0]."\t".$items[1]."\t".$items[2]."\t".$items[3]."\t".$thread."\t".$build."\t".$index."\t".$java17."\t".$items[4]."".$i."/\t".$items[5]."\t".$keep_coverage."\t".$disambiguate."\n";
        $i=$i+1;
                        }

close(SCRIPT);

#print("the next");
#system_call("qsub ".$job);
#unlink($job);
close(JOB);

#perl /home2/twang6/software/cancer/somatic/job_somatic.pl \
#design.txt \
#/project/bioinformatics/Xiao_lab/shared/neoantigen/code/somatic/example/example.sh \
#32 hg38 \
#/project/shared/xiao_wang/data/hg38/hs38d1.fa \
#/cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java \
#0 2
#/project/shared/xiao_wang/software/disambiguate_pipeline

sub system_call
{
  my $command=$_[0];
  print "\n\n".$command."\n\n";
  system($command);
}

现在程序在运转了，只不过出现了莫名原因的错误。但是我觉得问题不在于我的批处理的指令，而是我的somatic.ol文件，或者是我最近有更改了系统的配置（鼓捣安装mutect）。假如各方面都成功的，接下来要配置hg38的环境，不知道hg38环境的配置。除了索引文件之外。还需要什么（我想今晚就正确的把他挂上去，但是好像还是有些不太现实）？
主要出现的错误是：

Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.io.BufferedReader.readLine(BufferedReader.java:356)
at java.io.LineNumberReader.readLine(LineNumberReader.java:201)
at htsjdk.samtools.util.BufferedLineReader.readLine(BufferedLineReader.java:101)
at htsjdk.samtools.SAMTextReader.advanceLine(SAMTextReader.java:221)
at htsjdk.samtools.SAMTextReader.access $800 (S A M T e x t R e a d e r . j a v a : 37) a t h t s j d k . s a m t o o l s . S A M T e x t R e a d e r$ RecordIterator.next(SAMTextReader.java:257)
at htsjdk.samtools.SAMTextReader $R e c o r d I t e r a t o r . n e x t (S A M T e x t R e a d e r . j a v a : 228) a t h t s j d k . s a m t o o l s . S a m R e a d e r$ AssertingIterator.next(SamReader.java:591)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:570)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:182)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

由于Picard这一步的异常，因此无法产生下游的文件，导致整个程序运行中断。
查看这个问题的解决方法：
参考链接：https://blog.csdn.net/jxzxm1_2/article/details/2499751?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-7.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-7.control

如果你用unix/linux：
/tomcat/bin/catalina.sh 加上下面的命令:
JAVA_OPTS="-Xms32m -Xmx256m"

我终于明白之前在运行mutect的时候，我总是不明白类如-Xmx1g是什么意思，原来是提前声明内存的意思。

今天早上到实验室的时候，发现自己昨天的批处理的指令，全部的跑出了结果。这是最值得自己欣慰的一件事情了。虽然，有时候会觉得很郁闷，很无力，但是看着它每天都能够有一点点的进步。虽然很慢，但是往前走，有在前进，对于我来说就足够了，就是最大的鼓舞了。
今天的任务就是把那个数据挂到服务器上。没有其他的要求。在挂之前，要把环境配置成hg38的环境，确保其能够稳定的运行。
刚刚，解决了网路的问题，主要表现为：
（1）能够登得上微信，但是不能上浏览器。
后来，追溯，发现是我最近有使用一些软件。所以，软件在配置的时候，修改了浏览器的代理服务器。修改浏览器的代理设置即可。

现在接着继续解决问题。
使用hg38的参考基因组，查看还缺什么？

perl somatic.pl  NA  NA  RNA:./data/example_set/SRR5297065.fastq.gz  NA  32 hg38  ./geneome/hg38/hg38.fa  /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java  ./output_job0  human 1 ./disambiguate_pipeline >pipeline_job_0.txt

这行代码提交的时候，使用的是，tumor_only的这种模式，如果这个过程中，出现什么问题，就可以及时的检测到。关于缺少什么文件，或者存在什么错误，我都可以及时的修改出来。

出现了一个原则性的问题：

EXITING because of INPUT ERROR: could not open genomeFastaFile: ./geneome/hg38/hg38.fa

确实如此，我的基因组的文件是,hg38.fasta。再次修改。

perl somatic.pl  NA  NA  RNA:./data/example_set/SRR5297065.fastq.gz  NA  32 hg38  ./geneome/hg38/hg38.fasta  /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java  ./output_job0  human 1 ./disambiguate_pipeline >pipeline_job_0.txt

查看运行的结果。
出现错误:

pass1SJ.out.tab 这个文件不存在，或者程序对其无使用权限。

这个问题，我们之前有遇到过。主要有两条解决的策略，其一是查看是否有这个文件，其二就是修改文件的权限。
查看结果是没有，也就是说，索引生成的并不正确。所以，为什么呢？后来，看了一下，alignment.bam文件生成就不正确。pass1SJ.out.tab这个文件本身就没有生成。
后来，运行发现主要原因是在运行STAR的时候，程序被中途killed。

STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jul 02 14:57:35 … started STAR run
Jul 02 14:57:35 … loading genome
Jul 02 15:01:04 … started mapping
Killed

现在，遇到的问题就是说，这个程序运行着运行着，会被服务器给killed。但是呢，我之前用这个都是没有问题的，具体的原因是什么呢？
（1）监控这个程序的运行情况，看是否会出现，内存超出限制的情况（用刘兴义的方法）
（2）去其他free节点上看，是否可以继续运行。
（3）检查自己的参数是否用对。

今天（12：07），又去试了一下，能够接着往下运行。猜想可能是，昨天大家都在处理，所以，整体上节点会有被占满。
现在的想法，就是今天，一定要把K562的这个数据挂上去。
（1）首先，需要检查hg38的这个环境是否配置成功。
（2）其次，需要检查单端的tumor only的运行模式，是否能够运行成功。

===> 如果，这两部的测试都是正确的，那么接下来，我会将这个数据完整的挂上去。

而且，我这个是单端的数据，为什么依旧拆分为了两个fastq文件呢？对于单端的数据是这样转换的。

# create R2 if single end sequencing data are provided
if (${$fastq{$type}}[1] eq "NA")
{
  system_call("perl ".$path."somatic_script/reverse_complement.pl ".${$fastq{$type}}[0]." ".
    $type_output."/R2.fastq.gz");
  ${$fastq{$type}}[1]=$type_output."/R2.fastq.gz";
}

在这里，我们要做的事情就很简单，把那个design.txt文件，设置好即可。
由于这是一个大批量的重复劳作，打算用R来直接的生成这个表格，主要使用到的方法就是字符串的paste，已经生成文件。
首先读取，ls.log这个文件。我将其写成了一个函数：

generate_txt<-function(arg1,arg2,arg3,arg4,arg5){
  
  data<-as.matrix(read.table(arg1)) #输入文件
  row<-dim(data)[1]
  col<-dim(data)[2]
  stat<-arg2 #"/home/xxzhang/workplace/QBRC/data/CML/K562/"
  ax<-arg3 #"_1.fastq.gz"
  dataline<-paste(data,stat,sep = "")
  result<-c()
  #还是需要通过函数的形式来解决这件事
  for (i in 1:dim(data)[1]){
    dataline<-c()
    dataline<-paste(stat,data[i][1],ax,sep = "")
    result<-append(result,dataline)
    
  }
  #result的结果就是我们所需要的，最终还是应该使用循环来处理这件事情。
  #这个最终可以写成函数
  NA_NA<-rep("NA",row)
  work<-rep(arg4,row) #要保存的地方
  human<-rep("human",row)
  
  output<-cbind(NA_NA,NA_NA,result,NA_NA,work,human)
  
  write.table(output,arg5,quote = F,col.names = F,row.names = F,sep="\t") #保存的文件名
  
}

经过检验是可以比较顺利的执行，产生文本文件。

最后，我们将数据批量的放到服务器中。

正式的开始运行批处理的指令：
（1）首先生成一个somatic_calling_n.sh的文件。
（2）然后执行bash somatic_calling_n.sh这个文件。

perl job_somatic.pl K562.txt job_snv.pbs 32 hg38 /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta  /home/xxzhang/miniconda3/bin/java  1  91  ./disambiguate_pipeline

很快速的生成了文件：somatic_calling_91.sh。
这个文件的主要内容是：

perl /home/xxzhang/workplace/QBRC//somatic.pl NA NA /home/xxzhang/workplace/QBRC/data/CML/K562/SRR5289948_1.fastq.gz NA 32 hg38 /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta /home/xxzhang/miniconda3/bin/java /home/xxzhang/workplace/QBRC/output_job/89/ human
1 ./disambiguate_pipeline
perl /home/xxzhang/workplace/QBRC//somatic.pl NA NA /home/xxzhang/workplace/QBRC/data/CML/K562/SRR5289949_1.fastq.gz NA 32 hg38 /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta /home/xxzhang/miniconda3/bin/java /home/xxzhang/workplace/QBRC/output_job/90/ human
1 ./disambiguate_pipeline
perl /home/xxzhang/workplace/QBRC//somatic.pl NA NA /home/xxzhang/workplace/QBRC/data/CML/K562/SRR5289950_1.fastq.gz NA 32 hg38 /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta /home/xxzhang/miniconda3/bin/java /home/xxzhang/workplace/QBRC/output_job/91/ human
1 ./disambiguate_pipeline

最后运行bash指令，批量的运行命令。

bash somatic_calling_91.sh

还是出现了问题的，结果文档中找不到coverage.txt这个文件。

somatic_calling_91.sh: line 13: 1: command not found
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR…
[M::mem_pestat] (25, 50, 75) percentile: (50, 50, 50)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (50, 50)
[M::mem_pestat] mean and std.dev: (50.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (50, 50)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs

重新手动的删去\n符号，想检查一下job_somatic.pl文档是不是打错了。
后来，又发现出错了，我们的数据是RNA数据，所以在比对的时候又比对错了，没有加RNA:的标识。

今天早上看了作者的文献，发现了我在这个过程中的其他问题：
（1）使用的caller不对。

Lofreq∗(Wilm et al., 2012) was used for the ST and Slide-Seq datasets and Strelka2 Germline mode (Saunders et al., 2012) was used for other datasets.

（2）筛选的阈值不对。

All SNPs and Indels were combined and only kept if there were at least 5 total (wild-type and variant) reads and at least 3 variant reads.

再加上昨天的RNA：的标识以及\n的问题，这就是今天接下来，我要解决的问题。

首先，修改somatic.pl文件，将其tumor-only模式下的caller变为strelka（有点不太能够理解，两者选择的考虑）。 ==>修改完毕
其次，修改filter_vcf.R文件，将其筛选的标准变为5和3。 ==>修改完毕
最后，修改design.txt的生成的R代码，添加RNA:,并思考原因，为什么会出现奇怪的\n,要记住人可能会出错，但是计算机只会严格的执行你安排给他的任务。所以，反思是不是自己哪里写错了。

重新生成somatic_calling_91.sh。我们再来看一下，是否最终的文件，依旧是有奇怪出现的\n。

==>结果文件中，还是有出现了\n符号。
我们来研究一下job_somatic.pl这个文件。突然灵光乍现，只有一种可能就是human后面，有我们看不见的换行符。
看来位置是找对了，但是与此同时遇到了新的问题。
使用chmop()虽然能够去除换行符，但是改变了字符串的数值使其变为了1！这并不是我们想要的结果。于是，我们来看一下其他的结果。

最后，索性就直接填human,即可。

现在开始，着手解决唐老师的数据的问题。
我们先假设这个数据是合理的，然后开始一点点的分崩离析的去理标准化的每一个步骤，从而对这个数据有更好的理解和把握。

完成（22：00）。

今天也是个妖精头子呀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
实验记录 | 6/30-7/4

参考链接：https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2参照官网得mutect2的示例代码，运行（随意）： ./gatk Mutect2 -R /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.fa -I /home/xxzhang/workplace/output/tumor.bam -O mutect.vcf出现错误：Error: Invalid or c
复制链接

扫一扫

专栏目录