实验记录 | 6/9

(8:50)

sub fun_mutect{
  if ($pdx=~/mouse/) {$known=" ";} else {$known=" --cosmic ".$resource_cosmic." ";}
  system_call($java17." -Djava.io.tmpdir=".$mutect_tmp." -Xmx31g -jar ".$mutect." --analysis_type MuTect --reference_sequence ".$index.
    " --dbsnp ".$resource_dbsnp.$known." --input_file:tumor ".$tumor_bam." --input_file:normal ".$normal_bam.
    " --vcf ".$output."/mutect.vcf --out ".$output."/mutect.out");
}

我们在使用mutect进行mutation calling的过程中,需要使用到这样的一个文件。
早上到实验室的第一件事,注册登录网站:https://cancer.sanger.ac.uk/cosmic/download,下载数据集CosmicCodingMuts.vcf.gz。
不过下载失败很多次,整个文件的大小为738MB。等待下载中(9:30)。
在这里插入图片描述(9:51)由于下载经常性的中断,于是拒绝使用文件的直接浏览器下载,使用代码进行下载。
方法步骤:

  • 输入账号和密码:
    echo "email@example.com:mycosmicpassword" | base64
    然后,就会得到一段base64编码的东西:

ZW1haWxAZXhhbXBsZS5jb206bXljb3NtaWNwYXNzd29yZAo=

  • 使用curl获取url
    curl -H "Authorization: Basic ZW1haWxAZXhhbXBsZS5jb206bXljb3NtaWNwYXNzd29yZAo=" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz
    得到一段url的编码:

{
“url” : “https://cog.sanger.ac.uk/cosmic/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz?AWSAccessKeyId=KFGH85D9KLWKC34GSl88&Expires=1521726406&Signature=Jf834Ck0%8GSkwd87S7xkvqkdfUV8%3D”
}

  • 使用curl&url,进行再次的下载
    curl "https://cog.sanger.ac.uk/cosmic/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz?AWSAccessKeyId=KFGH85D9KLWKC34GSl88&Expires=1521726406&Signature=Jf834Ck0%8GSkwd87S7xkvqkdfUV8%3D"

这个过程,正常的下载时间是一个小时(等一下,我去试一下)。

在下载的过程中,我们继续看昨天的错误信息,修改原始的代码文件,并另存为somatic_RNA.pl。

关于比对的时候,使用bwa,却没有使用star进行比对的错误的解释。

$rna=0; # exome-seq
    if (${$fastq{$type}}[0]=~s/RNA\://) {$rna=1;} # rna-seq
    if (${$fastq{$type}}[0]=~s/Deep\://) {$rna=2;} # deep exome-seq

对出错原因的解释:程序是不会自主的判定数据是rna还是dna类型的数据的,因此,需要我们人为的声明,而我在书写代码的过程中,就遗漏了这一处的细节(所以,程序语言要比人类敏锐精确的多)。
所以,按照要求:

if RNA-Seq data are used, use “RNA:fastq1” or “RNA:bam” at the first or third slot
If only single end fastq data are available, put the fastq file(s) at the first and/or the third slots,
then put NA in the second and/or fourth slot.

我将代码修改为:

perl somatic_RNA.pl  RNA:./data/SRR3052083.fastq.gz  NA  RNA:./data/SRR5297065.fastq.gz NA  32 hg19 ./geneome/hg19/hg19.fa /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java  ./output_RNA  human 1 ./disambiguate_pipeline>pipeline_RNA_2.txt

还有一处错误即是,需要将speedseq指令,修改为speedseq.config所在的路径。

/home/xxzhang/workplace/software/speedseq/bin/speedseq.config

在somatic.pl中,进行修改。然后另存为somatic_RNA.pl。

好的,现在我切换到Linux界面下,主要进行两件事的操作。
(1)将sra转换为fastq.gz的格式。
(2)下载文件CosmicCodingMuts.vcf.gz。


(10:15)

(1)将fastq文件转换为fastq.gz文件。
gzip SRR3052083.fastq
gzip SRR5297065.fastq
得到两个文件:
ls

SRR3052083.fastq.gz
SRR3052083.sra
SRR5297065.fastq.gz
SRR5297065.sra

所以,这个问题,解决完毕。

(2)下载CosmicCodingMuts.vcf.gz文件。

  • 第一步:
    echo "1722051071@stmail.ntu.edu.cn:72513210Zhang*" | base64

MTcyMjA1MTA3MUBzdG1haWwubnR1LmVkdS5jbjo3MjUxMzIxMFpoYW5nKgo=

  • 第二步:
    curl -H "Authorization: Basic MTcyMjA1MTA3MUBzdG1haWwubnR1LmVkdS5jbjo3MjUxMzIxMFpoYW5nKgo=" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh37/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz

{
“url”:“https://cog.sanger.ac.uk/cosmic/GRCh37/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz?AWSAccessKeyId=KRV7P7QR9DL41J9EWGA2&Expires=1623207937&Signature=2IR%2FeTTEriSLXmfyImHrYOEpOCU%3D”
}

  • 第三步:
    curl "https://cog.sanger.ac.uk/cosmic/GRCh37/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz?AWSAccessKeyId=KRV7P7QR9DL41J9EWGA2&Expires=1623207937&Signature=2IR%2FeTTEriSLXmfyImHrYOEpOCU%3D" --output ./

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- 0:00:01 --:–:-- 0Warning: Failed to create the file ./: Is a directory
0 737M 0 15010 0 0 9009 0 23:51:34 0:00:01 23:51:33 9004
curl: (23) Failure writing output to destination

显示下载失败。什么原因呢?
是不是权限的原因,我们计划输出的结果文件夹不可写。
后来仔细查看出错记录,发现问题的关键在这里:

Failed to create the file ./: Is a directory

所以,整个问题的症结在于,我的输出是一个文件夹,而不是一个确切的文件名。所以出错的地方在这里。
继续修改(调整output的输出):
curl "https://cog.sanger.ac.uk/cosmic/GRCh37/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz?AWSAccessKeyId=KRV7P7QR9DL41J9EWGA2&Expires=1623207937&Signature=2IR%2FeTTEriSLXmfyImHrYOEpOCU%3D" --output ./data/CosmicCodingMuts_hg19.vcf.gz

(11:07)现在文件可以正常的输出了(预计用时2个小时)。
接下来,就静静的等待收获了。


距离我吃饭(12:30)应该还有1.5小时。在这段时间里,我或许可以看一下文献。


(13:05)下载完成。

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 739M 100 739M 0 0 122k 0 1:43:18 1:43:18 --:–:-- 78349

现在,问题,基本上解决了。

将文件上传至文件夹中,并重命名。

/home/xxzhang/workplace/QBRC/./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf

切换至window界面下。
传输完成。
(13:20)


接着我们回到服务器上,重新运行指令。
出现了一对的错误,首先第一个问题,就是我前面“随意”改了文件的名字,所造成的“雪崩式”的结局。找不到文件夹中的文件。

糟糕,刚刚误删了一个文件。

重新回到window,修改指令文件。

perl somatic.pl  RNA:./data/SRR3052083.fastq.gz  NA  RNA:./data/SRR5297065.fastq.gz NA  32 hg19 ./geneome/hg19/hg19.fa /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java  ./output_RNA  human 1 ./disambiguate_pipeline


修改完成。接下来,出现的这个错误,也是比较常见的,我忽略的一个错误:我忘记建立STAR过程要用的索引了。
/home/xxzhang/workplace/software/STAR/bin/Linux_x86_64/STAR --genomeDir ./geneome/hg19/STAR --readFilesIn ./output_RNA/normal/fastq1.fastq ./output_RNA/normal/fastq2.fastq --runThreadN 32 --outFileNamePrefix ./output_RNA/normal/tmp/pass1 --outFilterMatchNminOverLread 0.4 --outFilterScoreMinOverLread 0.4

STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 09 13:48:14 … started STAR run
Jun 09 13:48:14 … loading genome
EXITING because of FATAL ERROR: could not open genome file ./geneome/hg19/STAR//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions
Jun 09 13:48:14 … FATAL ERROR, exiting

意思就是说,在./geneome/hg19/STAR路径下,没有genomeParameters.txt这个文件。因此,运行报错。
所以,我现在要先建立hg19的索引文件(之前,虽然有从学姐那里获取到hg38的文件,但是我不清楚其所使用的fa以及gff文件,所以,还是希望自己亲自建立。如果这个过程出错,更方便追踪)

我先睡一会儿,稍等片刻后,建立STAR比对的索引。

看一下,STAR指令,建立索引需要什么?(之前,自己尝试过)

STAR --runThreadN 6 --runMode genomeGenerate --genomeDir index_dir --genomeFastaFiles genome.fasta --sjdbGTFfile genome.gtf --sjdbOverhang 149

还需要下载一个hg19的注释文档。

看一下,STAR指令的内容。

genomeDir ./GenomeDir/
string: path to the directory where genome files are stored (for --runMode alignReads) or will be generated (for --runMode generateGenome)
genomeFastaFiles -
string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they cannot be zipped.
Required for the genome generation (–runMode genomeGenerate). Can also be used in the mapping (–runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).

在下载基因组注释文件的时候,就遇到了一个问题(基因组注释文档及其所对应的关系)。
参考链接:https://www.plob.org/article/9946.html?wpzmaction=add&postid=9946

基因组注释文档的下载有多种方式,我们此次使用ENSEMBL来建立操作。

GRCh36 (hg18): ENSEMBL release_52.
GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.

因为我们的参考基因组,选用的是hg19(GRCh37)版本,所以,注释文档也选用hg19(release_75)。
下载,注释文档。
wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

–2021-06-09 14:54:06-- ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
=> ‘Homo_sapiens.GRCh37.75.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)… 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:21… connected.
Logging in as anonymous … Logged in!
> SYST … done. > PWD … done.
> TYPE I … done. > CWD (1) /pub/release-75/gtf/homo_sapiens … done.
> SIZE Homo_sapiens.GRCh37.75.gtf.gz … 39344043
> PASV … done. > RETR Homo_sapiens.GRCh37.75.gtf.gz … done.
Length: 39344043 (38M) (unauthoritative)
100%[
====================================================================================================================================================>] 39,344,043 325KB/s in 68s
2021-06-09 14:55:19 (563 KB/s) - ‘Homo_sapiens.GRCh37.75.gtf.gz’ saved [39344043]

下载完成。

cp /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.fa /home/xxzhang/workplace/QBRC/geneome/hg19/STAR
gunzip -d Homo_sapiens.GRCh37.75.gtf.gz 
mv Homo_sapiens.GRCh37.75.gtf hg19.gtf

注释文件,准备好之后,就可以建立索引了。

/home/xxzhang/workplace/software/STAR/bin/Linux_x86_64/STAR --runThreadN 32 --runMode genomeGenerate --genomeDir /home/xxzhang/workplace/QBRC/geneome/hg19/STAR  --genomeFastaFiles hg19.fa  --sjdbGTFfile hg19.gtf --sjdbOverhang 149

#修改了线程数为32。

开始进行建索引。
报出错误:

Jun 09 15:04:00 … started STAR run
Jun 09 15:04:00 … starting to generate Genome files
Jun 09 15:05:09 … processing annotations GTF
Fatal INPUT FILE error, no valid exon lines in the GTF file: hg19.gtf
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.
Jun 09 15:05:19 … FATAL ERROR, exiting

这个问题,好像我之前也遇到过。
head -n 10 hg19.fa

chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

head -n 10 hg19.gtf

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1 pseudogene gene 11869 14412 . + . gene_id “ENSG00000223972”; gene_name “DDX11L1”; gene_source “ensembl_havana”; gene_biotype “pseudogene”;
1 processed_transcript transcript 11869 14409 . + . gene_id “ENSG00000223972”; transcript_id “ENST00000456328”; gene_name “DDX11L1”; gene_source “ensembl_havana”; gene_biotype “pseudogene”; transcript_name “DDX11L1-002”; transcript_source “havana”;

就是对染色体的命名方式不太一样。一个是chr1,另一个是1
我们该如何解决呢?
使用perl,将gtf文件进行转换,将开头的1,替换为chr1(不得不说,perl在文本处理这块的能力还是挺强的)。

#!/usr/bin/perl
#first chr numble must add chr.
open(FR,"hg19.gtf")||die;
open(FW,">hg19_modified.gtf")||die;     
while(TRUE){
chomp;
if($_=~/^\d/ or $_=~/^X/ or $_=~/^MT/ or $_=~/^Y/) 
#这个判断语句考虑的不够周全,我又添加了 $_=~/^Y/    
{
print FW "chr$_\n"}
else {print FW "\n"}
}
close FR;
close FW;

写这个代码之前,需要先弄清楚,这个大文件中,所有包含染色体编号的行都找到。
(我还是对Linux下,文本处理指令不够熟悉,生气)
好啊,用catgrep指令试一试。
(虽然我想更加精确一点,但是我的能力,让我不行。。我比较期待的结果是,匹配到所有的染色体的表示,但是呢?但是,每次输出,都是输出染色体号码和后面的一长串的字符)
功夫不负有心人,我找到了。
grep -o -E "^\w+([-+.]\w+)*" hg19.gtf|sort|uniq
输出结果:

1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
GL000191.1
GL000192.1
GL000193.1
GL000194.1
GL000195.1
GL000196.1
GL000199.1
GL000201.1
GL000204.1
GL000205.1
GL000209.1
GL000211.1
GL000212.1
GL000213.1
GL000215.1
GL000216.1
GL000218.1
GL000219.1
GL000220.1
GL000221.1
GL000222.1
GL000223.1
GL000224.1
GL000225.1
GL000228.1
GL000229.1
GL000230.1
GL000231.1
GL000233.1
GL000236.1
GL000237.1
GL000240.1
GL000241.1
GL000242.1
GL000243.1
GL000247.1
HG1007_PATCH
HG1032_PATCH
HG104_HG975_PATCH
HG1063_PATCH
HG1074_PATCH
HG1079_PATCH
HG1082_HG167_PATCH
HG1091_PATCH
HG1133_PATCH
HG1146_PATCH
HG115_PATCH
HG1208_PATCH
HG1211_PATCH
HG122_PATCH
HG1257_PATCH
HG1287_PATCH
HG1292_PATCH
HG1293_PATCH
HG1304_PATCH
HG1308_PATCH
HG1322_PATCH
HG1350_HG959_PATCH
HG1423_PATCH
HG1424_PATCH
HG1425_PATCH
HG1426_PATCH
HG142_HG150_NOVEL_TEST
HG1433_PATCH
HG1434_PATCH
HG1435_PATCH
HG1436_HG1432_PATCH
HG1437_PATCH
HG1438_PATCH
HG1439_PATCH
HG1440_PATCH
HG1441_PATCH
HG1442_PATCH
HG1443_HG1444_PATCH
HG144_PATCH
HG1453_PATCH
HG1458_PATCH
HG1459_PATCH
HG1462_PATCH
HG1463_PATCH
HG1472_PATCH
HG1479_PATCH
HG1486_PATCH
HG1487_PATCH
HG1488_PATCH
HG1490_PATCH
HG1497_PATCH
HG14_PATCH
HG1500_PATCH
HG1501_PATCH
HG1502_PATCH
HG151_NOVEL_TEST
HG1591_PATCH
HG1592_PATCH
HG1595_PATCH
HG1699_PATCH
HG174_HG254_PATCH
HG183_PATCH
HG185_PATCH
HG186_PATCH
HG193_PATCH
HG19_PATCH
HG237_PATCH
HG243_PATCH
HG256_PATCH
HG271_PATCH
HG27_PATCH
HG280_PATCH
HG281_PATCH
HG299_PATCH
HG29_PATCH
HG305_PATCH
HG306_PATCH
HG311_PATCH
HG325_PATCH
HG329_PATCH
HG339_PATCH
HG344_PATCH
HG348_PATCH
HG357_PATCH
HG375_PATCH
HG385_PATCH
HG388_HG400_PATCH
HG414_PATCH
HG417_PATCH
HG418_PATCH
HG444_PATCH
HG480_HG481_PATCH
HG497_PATCH
HG506_HG507_HG1000_PATCH
HG50_PATCH
HG531_PATCH
HG536_PATCH
HG544_PATCH
HG686_PATCH
HG706_PATCH
HG729_PATCH
HG730_PATCH
HG736_PATCH
HG745_PATCH
HG747_PATCH
HG748_PATCH
HG75_PATCH
HG79_PATCH
HG7_PATCH
HG858_PATCH
HG865_PATCH
HG871_PATCH
HG873_PATCH
HG883_PATCH
HG905_PATCH
HG944_PATCH
HG946_PATCH
HG953_PATCH
HG957_PATCH
HG962_PATCH
HG971_PATCH
HG979_PATCH
HG987_PATCH
HG989_PATCH
HG990_PATCH
HG991_PATCH
HG996_PATCH
HG998_1_PATCH
HG998_2_PATCH
HG999_1_PATCH
HG999_2_PATCH
HSCHR10_1_CTG2
HSCHR10_1_CTG5
HSCHR1_1_CTG31
HSCHR12_1_CTG1
HSCHR12_1_CTG2_1
HSCHR12_1_CTG5
HSCHR12_2_CTG2
HSCHR12_2_CTG2_1
HSCHR12_3_CTG2_1
HSCHR1_2_CTG31
HSCHR1_3_CTG31
HSCHR15_1_CTG4
HSCHR15_1_CTG8
HSCHR16_1_CTG3_1
HSCHR16_2_CTG3_1
HSCHR17_1
HSCHR17_1_CTG1
HSCHR17_1_CTG4
HSCHR17_2_CTG4
HSCHR17_3_CTG4
HSCHR17_4_CTG4
HSCHR17_5_CTG4
HSCHR17_6_CTG4
HSCHR18_1_CTG1_1
HSCHR18_1_CTG2_1
HSCHR18_2_CTG2
HSCHR18_2_CTG2_1
HSCHR19_1_CTG3
HSCHR19_1_CTG3_1
HSCHR19_2_CTG3
HSCHR19_3_CTG3
HSCHR19LRC_COX1_CTG1
HSCHR19LRC_COX2_CTG1
HSCHR19LRC_LRC_I_CTG1
HSCHR19LRC_LRC_J_CTG1
HSCHR19LRC_LRC_S_CTG1
HSCHR19LRC_LRC_T_CTG1
HSCHR19LRC_PGF1_CTG1
HSCHR19LRC_PGF2_CTG1
HSCHR20_1_CTG1
HSCHR21_2_CTG1_1
HSCHR21_3_CTG1_1
HSCHR21_4_CTG1_1
HSCHR2_1_CTG1
HSCHR2_1_CTG12
HSCHR22_1_CTG1
HSCHR22_1_CTG2
HSCHR22_2_CTG1
HSCHR2_2_CTG12
HSCHR3_1_CTG1
HSCHR3_1_CTG2_1
HSCHR4_1
HSCHR4_1_CTG12
HSCHR4_1_CTG6
HSCHR4_2_CTG9
HSCHR5_1_CTG1
HSCHR5_1_CTG2
HSCHR5_1_CTG5
HSCHR5_2_CTG1
HSCHR5_3_CTG1
HSCHR6_1_CTG5
HSCHR6_MHC_APD
HSCHR6_MHC_COX
HSCHR6_MHC_DBB
HSCHR6_MHC_MANN
HSCHR6_MHC_MCF
HSCHR6_MHC_QBL
HSCHR6_MHC_SSTO
HSCHR7_1_CTG6
HSCHR9_1_CTG1
HSCHR9_1_CTG35
HSCHR9_2_CTG35
HSCHR9_3_CTG35
MT
X
Y

难道这个显示的就是这个gtf文件中的顺序吗?这并不是我们希望的结果,以后也许会报错。因为这个顺序并非是按照染色体的数目排列的。

1
2
3
4
5
6
7
X
8
9
10
11
12
13
14
15
16
17
18
20
19
Y
22
21
HG1287_PATCH
HG1459_PATCH
HSCHR6_MHC_SSTO
HSCHR6_MHC_MCF
HSCHR6_MHC_COX
HSCHR6_MHC_MANN
HSCHR6_MHC_APD
HSCHR6_MHC_QBL
HSCHR6_MHC_DBB
HG1433_PATCH
HG1257_PATCH
HG1497_PATCH
HG1436_HG1432_PATCH
HG1211_PATCH
HSCHR17_1
HG1292_PATCH
HSCHR5_1_CTG1
HG1592_PATCH
HG1425_PATCH
HG1462_PATCH
HG7_PATCH
HG1458_PATCH
HSCHR19LRC_LRC_J_CTG1
HSCHR19LRC_LRC_S_CTG1
HSCHR19LRC_LRC_I_CTG1
HG1079_PATCH
HG1443_HG1444_PATCH
HG979_PATCH
HSCHR19LRC_LRC_T_CTG1
HSCHR19LRC_COX1_CTG1
HSCHR19LRC_PGF1_CTG1
HG1453_PATCH
HG1423_PATCH
HG1437_PATCH
HG865_PATCH
HG1434_PATCH
HG1438_PATCH
HSCHR19LRC_PGF2_CTG1
HG1293_PATCH
HG1426_PATCH
HSCHR19LRC_COX2_CTG1
HG1308_PATCH
HG1463_PATCH
HG987_PATCH
HG19_PATCH
HG953_PATCH
HSCHR4_1
HG730_PATCH
HG1350_HG959_PATCH
GL000192.1
HG306_PATCH
HG1082_HG167_PATCH
HG75_PATCH
HG1424_PATCH
HG174_HG254_PATCH
HG357_PATCH
HG29_PATCH
HG256_PATCH
HG183_PATCH
HG1146_PATCH
HG1441_PATCH
HG871_PATCH
HG104_HG975_PATCH
HG1442_PATCH
HG185_PATCH
HG122_PATCH
HSCHR12_2_CTG2
HG1091_PATCH
HG544_PATCH
HG1440_PATCH
HG686_PATCH
HG1439_PATCH
HSCHR15_1_CTG8
HG747_PATCH
HSCHR19_1_CTG3
HG748_PATCH
HSCHR4_1_CTG6
HG1133_PATCH
HSCHR1_3_CTG31
HG115_PATCH
HG736_PATCH
HG417_PATCH
HG1591_PATCH
HG745_PATCH
HG1595_PATCH
HG79_PATCH
HG946_PATCH
HG388_HG400_PATCH
HG281_PATCH
HSCHR10_1_CTG5
HG237_PATCH
HG1074_PATCH
HG1032_PATCH
HSCHR15_1_CTG4
HG962_PATCH
HSCHR18_1_CTG1_1
HG480_HG481_PATCH
HG944_PATCH
HSCHR17_4_CTG4
HG536_PATCH
HG444_PATCH
HG27_PATCH
HG344_PATCH
HG1435_PATCH
HSCHR17_1_CTG1
HG1063_PATCH
HG957_PATCH
HG1304_PATCH
HG14_PATCH
HG706_PATCH
HG729_PATCH
HG348_PATCH
HG1699_PATCH
HSCHR5_3_CTG1
HSCHR17_2_CTG4
HG329_PATCH
HG385_PATCH
HG1502_PATCH
GL000225.1
HG243_PATCH
HSCHR21_2_CTG1_1
HG142_HG150_NOVEL_TEST
HG1322_PATCH
HSCHR18_2_CTG2
HG311_PATCH
HG883_PATCH
HSCHR16_1_CTG3_1
GL000194.1
HG151_NOVEL_TEST
HG414_PATCH
GL000193.1
HSCHR19_3_CTG3
GL000222.1
GL000212.1
HG271_PATCH
HSCHR12_1_CTG5
HG1490_PATCH
GL000195.1
HSCHR1_1_CTG31
HSCHR3_1_CTG2_1
GL000223.1
HG506_HG507_HG1000_PATCH
GL000224.1
HSCHR10_1_CTG2
GL000219.1
HG339_PATCH
GL000205.1
HSCHR5_1_CTG5
HSCHR3_1_CTG1
GL000215.1
GL000216.1
HG186_PATCH
HSCHR9_2_CTG35
HSCHR19_2_CTG3
HG971_PATCH
GL000199.1
HG1501_PATCH
HSCHR12_1_CTG2_1
HG905_PATCH
HSCHR18_1_CTG2_1
HG873_PATCH
HSCHR12_1_CTG1
HG1486_PATCH
GL000211.1
HG996_PATCH
HSCHR4_1_CTG12
GL000213.1
HG858_PATCH
HSCHR9_1_CTG1
HSCHR22_1_CTG1
GL000220.1
GL000218.1
HSCHR18_2_CTG2_1
GL000209.1
HG50_PATCH
HSCHR19_1_CTG3_1
GL000221.1
HSCHR12_3_CTG2_1
HG989_PATCH
HG193_PATCH
HSCHR2_1_CTG1
HSCHR12_2_CTG2_1
HSCHR17_1_CTG4
GL000228.1
HSCHR20_1_CTG1
HSCHR6_1_CTG5
HSCHR2_1_CTG12
HSCHR4_2_CTG9
HG305_PATCH
HSCHR7_1_CTG6
HSCHR21_4_CTG1_1
HSCHR1_2_CTG31
HG299_PATCH
GL000191.1
HG418_PATCH
HSCHR5_1_CTG2
HG325_PATCH
HG1208_PATCH
HSCHR22_1_CTG2
HSCHR2_2_CTG12
HG1479_PATCH
HSCHR17_3_CTG4
HSCHR16_2_CTG3_1
HSCHR17_6_CTG4
HSCHR5_2_CTG1
GL000204.1
HG280_PATCH
HSCHR21_3_CTG1_1
HSCHR22_2_CTG1
HG1488_PATCH
HSCHR9_1_CTG35
HG1487_PATCH
HG144_PATCH
HSCHR17_5_CTG4
HG1007_PATCH
HG991_PATCH
HG375_PATCH
HG998_1_PATCH
HSCHR9_3_CTG35
HG999_2_PATCH
HG999_1_PATCH
HG998_2_PATCH
GL000233.1
GL000237.1
HG990_PATCH
GL000230.1
HG497_PATCH
GL000242.1
GL000243.1
GL000241.1
GL000236.1
GL000240.1
GL000196.1
GL000247.1
GL000201.1
HG531_PATCH
GL000231.1
HG1472_PATCH
HG1500_PATCH
GL000229.1
MT

这个结果是我们想要的,但是我们可以看到它的顺序还是不对的(我不知道这样接下来会不会报错)。
我现在先解决最棘手的问题,就是在染色体号之前,添加chr
我们在此首要解决这个问题。
修改以上代码,程序运行中(18:45)。
(不知道什么时候运行结束)
希望,今天将STAR比对运行结束。不知道这周能不能把我的目标完成。
这个比对的运行好慢啊。
我有点怀疑是不是,我的方法错了,真的好慢,快一个小时了。

(19:37)或者我可以直接使用GRCh38的数据来做。因为我之前所搭建的一些数据都是基于hg19的,所以,成本更高一点。

我在等的过程中,不应该浪费时间。
我觉得自己今天听了学姐讲“幼态发育”的课题,我还挺感兴趣的。涉及到,能够将人的发育与其他近缘物种进行比较。就挺有意思的。

或者,我们可以尝试下载其他类型的gtf文档。

ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
预估时间:3个小时

无论是比对,还是下载(转换)注释文件,都非常的耗费时间。


称着这漫长的时间,我可以研究一下,那个job_somatic.pl怎样运行。

perl job_somatic.pl ./job/somatic_design.txt ./somatic/example/example.sh 32 hg19 ./geneome/hg19/hg19.fa /apps/java/home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java 1 4 ./disambiguate_pipeline

各个参数的具体意义:

design.txt: the batch job design file. It has 6 columns separated by ‘\t’, the first four slots are fastq files or bam files for normal control sample and sample of interest. The fifth is the output folder, and the last is “PDX” or “human”.
example_file : the demo job submission shell script. A default one is in example/.
thread : number of threads to use. Recommended: 32
build : genome build, hg19 or hg38.
index : path (including file names) to the reference genome in the reference bundle.
java17 : path (including the executable file name) to java 1.7 (needed only for MuTect).
keep_coverage: whether to keep per-base coverage information. Default is 0. Set to 1 to keep coverage information.
n : bundle $n somatic calling job into one submission.
disambiguate_pipeline: the directory to disambiguate_pipeline

我们先用来编辑somatic_design.txt文件:

NA NA ./example/example_dataset/sequencing/SRR7246238_1.fastq.gz ./example/example_dataset/sequencing/SRR7246238_2.fastq.gz ./out_job human
NA NA ./example/example_dataset/sequencing/SRR7246238_1.fastq.gz ./example/example_dataset/sequencing/SRR7246238_2.fastq.gz ./out_job human
NA NA ./example/example_dataset/sequencing/SRR7246238_1.fastq.gz ./example/example_dataset/sequencing/SRR7246238_2.fastq.gz ./out_job human
NA NA ./example/example_dataset/sequencing/SRR7246238_1.fastq.gz ./example/example_dataset/sequencing/SRR7246238_2.fastq.gz ./out_job human
(复制多行)

那么,example.sh是什么样的一种结构呢

#!/bin/bash
#SBATCH --job-name=c
#SBATCH --partition=256GB
#SBATCH --nodes=1
#SBATCH --time=60-00:00:00
#SBATCH --output=./sbatch_output_%j
#SBATCH --error=./sbatch_error_%j
source ~/.bash_profile
##########JOBSTART###########################
perl job_somatic.pl /project/job_somatic.txt example.sh 32 hg38 /project/data/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java 3

明天再来研究为什么?(21:23)

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值