1。重新核实咱们感兴趣的序列属于哪一个家族?
首先需要建库。
bowtie2-build humrep.fasta HumRep
bowtie2 -f --local --very-sensitive -x HumRep -U aluY.fasta -S aluY.sam
samtools view -b aluY.sam >aluY.bam
#这次就没有报错了。但是仍然存在一些问题。
(base) [xxzhang@mu02 RepeatAnnoation]$ grep "ALU" aluY.sam
@SQ SN:ALU LN:312
hg38_rmsk_AluY 0 ALU 1 2 61M2D65M12I136M2D24M * 0 0 GGCCAGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCTGATCGTGAGGTCAGGAGATTGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAAAAAAAAAAAAATTAGCCGGGCATGGTGGTGGGCACCTTTAGTCCCAGCTACTCGGGAGGCTGAGATAGGAGAATGGCGTGAACCCGGGAGGCGGAGGTTGCAGTAAGCCGAGATCGCGCTGCTGCACTCCAGCCTGGGCGACAGCGAGACTCCATCTCAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:189 XS:i:171 XN:i:0 XM:i:40 XO:i:3 XG:i:16 NM:i:56 MD:Z:4G51A0G3^TG1T4C0C6T1C7G5G0C5A1C2G13A23G6C1C2G3G26G0C5G2C2T3G3A4T0T0C4C6G3T0A0T8C0A21^AG8C1T0G12 YT:Z:UU
(base) [xxzhang@mu02 RepeatAnnoation]$ grep "ALU" humrep.fasta
>ALU SINE1/7SL Primates
这里虽然能够比对到这个数据库中,但是比对的质量并不是很高。且这边物种显示的是 Primates(灵长类动物)我觉得范围还是有点大的,并不是人类所特有的。这块连大猩猩都没有的。所以个人觉得这个reference,依旧并不是特别的准确。
检索目前已有的重复序列的数据库:
- Repbase 数据库 感觉信息并不是特别的准确?
- CONS-Dfam_3.0数据库
- dc20170127-rb20170127数据库
cp humsub.ref ..
mv humsub.ref humsub.fasta
bowtie2-build humsub.fasta HumSub
2。拆分单个细胞的数据
#! perl
#这个代码的主要的功能:
#读取barcode文件的@行,其他行都一样其实没啥意义
#得到这个list之后
#然后分别到文件R1和R3中查找。
#如果能够找到?则提取其余下的4行,保存到一个新的文件中。以此遍历,直到读取到该文件的最后一行为止。
#总的来说,需求是非常简单的,现在不行也要上了。
my $barcode=$ARGV[0];
my $R1=$ARGV[1];
my $R3=$ARGV[2];
my $newR1=$ARGV[3];
my $newR3=$ARGV[4];
print '$ARGV[0]==>',$ARGV[0],"\n",
'$ARGV[1]==>',$ARGV[1],"\n",
'$ARGV[2]==>',$ARGV[2],"\n",
'$ARGV[3]==>',$ARGV[3],"\n",
'$ARGV[4]==>',$ARGV[4],"\n";
my @ID=();
my @linelist;
my @strArr;
my @strR1;
my @strBrr;
my @strR3;
#读取barcode文件的@行
open (MARK, "< $barcode") or die "can not open it!";
while ($line = <MARK>){
if ($line =~m/^@/){
push(@ID,$line);
}
}
#print @ID;
close(MARK);
#print $ID[0]; #这个需求到这里就算是完成了,我们的数据保存在了数组@ID中
#@strArr=split(/ /,$ID[0]);
#print "$strArr[0]\n";
open (R1,"+< $R1") or die "can not open R1 !";
open (NEWR1,"> $newR1")or die "can not open new file!";
$/ ="\n";
@linelist=<R1>;
#print $linelist[1];
#while ($line=<R1>){
# print $line;
#
#}
#my $i =0;
for ($i=0;$i<$#linelist+1;$i=$i+4){
#print $linelist[$i];
@strR1=split(/ /,$linelist[$i]); #这个R1就是序列1
my $m =$strR1[0];
#print $m;
#print "$strR1[0]\n";
for ($j=0;$j<$#ID+1;$j=$j+1){
@strArr=split(/ /,$ID[$j]);
my $n = $strArr[0];
#print $n;
if ($n eq $m){
#应该仅存在一一对应的关系
#print "yes! \n";
#到这里算是成功了,接下来就是把那四行写入一个新的文件中
print NEWR1 "$linelist[$i]$linelist[$i+1]$linelist[$i+2]$linelist[$i+3]";
#print "\n";
}
}
}
close(R1);
close(newR1);
open (R3,"+< $R3") or die "can not open R3 !";
open (NEWR3,"> $newR3")or die "can not open new file!";
$/ ="\n";
@linelist=<R3>;
#print $linelist[1];
#while ($line=<R1>){
# print $line;
#
#}
#my $i =0;
for ($i=0;$i<$#linelist+1;$i=$i+4){
#print $linelist[$i];
@strR3=split(/ /,$linelist[$i]); #这个R1就是序列1
my $m =$strR3[0];
#print $m;
#print "$strR1[0]\n";
for ($j=0;$j<$#ID+1;$j=$j+1){
@strBrr=split(/ /,$ID[$j]);
my $n = $strBrr[0];
#print $n;
#print "$j\n\n";
if ($n eq $m){
print NEWR3 "$linelist[$i]$linelist[$i+1]$linelist[$i+2]$linelist[$i+3]";
print "yes!aligned!";
last;
}
#print "$j\n\n";
}
}
close(R3);
close(newR3);
使用浑身的解数,写了一个perl代码。由于我自己超喜欢用循环,所以目测这个运行起来会超级慢。不知道今天一晚上是否能够跑完。与此同时还想继续践行方案一,因为我们今天换了一个更加精确的reference(上次用的那个reference简直太荒唐了)。
bowtie2 -q -p 32 --local --no-unal -x HumSub -1 ./fastq/V1C/ATAC_S1_20210413NA_S2_L002_R1_001.fastq -2 ./fastq/V1C/ATAC_S1_20210413NA_S2_L002_R3_001.fastq -S result_HumSub.sam
今天就到这里了,先挂在这里。明天过来收结果。
好想学Python啊,好想优化一下代码的运行速率啊。我感觉我的自我教育并不充分,现在都是在捡之前学的东西,没有一点是自我教育的成果。
计算机基础的感觉并不难,但是每次进阶版的代码我就常常看不懂了。迷茫,惆怅。
刚刚写perl,居然还写了100行的代码,真是厉害了。所以,还是要多学多总结,多积累项目的经验,在需求的过程中不断的打磨自己。
我心态崩了,辛辛苦苦写出来的代码,被系统kill了,也许真的太大了这个reads,逐行的遍历要命。
也是我现在对于算法的优化这块学习的不够,嗯,也是要多多学习。当时数据结构应该好好学习的。
我现在可咋办?
3。筛选数据中的reads。
这个是昨天全部比对的结果。
691467535 reads; of these:
691467535 (100.00%) were paired; of these:
670759925 (97.01%) aligned concordantly 0 times
1972294 (0.29%) aligned concordantly exactly 1 time
18735316 (2.71%) aligned concordantly >1 times
----
670759925 pairs aligned concordantly 0 times; of these:
384356 (0.06%) aligned discordantly 1 time
----
670375569 pairs aligned 0 times concordantly or discordantly; of these:
1340751138 mates make up the pairs; of these:
1325809172 (98.89%) aligned 0 times
1347166 (0.10%) aligned exactly 1 time
13594800 (1.01%) aligned >1 times
4.13% overall alignment rate
bowtie的比对的结果的解读:
grep "AluYe2" result_HumSub.sam
samtools view -bF 12 result_HumSub.sorted.bam >result_HumSub.sorted.F12.bam
samtools flagstat result_HumSub.sorted.bam
samtools flagstat result_HumSub.sorted.F12.bam
samtools sort -n -o namesorted.bam result_HumSub.sam
samtools fixmate -m namesorted.bam fixmate.bam
samtools sort -o positionsort.bam fixmate.bam
samtools markdup -r positionsort.bam redup.bam
samtools view -h -o redup.sam redup.bam
samtools flagstat redup.sam
7564002 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
7564002 + 0 mapped (100.00% : N/A)
7563632 + 0 paired in sequencing
3781816 + 0 read1
3781816 + 0 read2
4898198 + 0 properly paired (64.76% : N/A)
**7563632 + 0 with itself and mate mapped**
0 + 0 singletons (0.00% : N/A)
2227594 + 0 with mate mapped to a different chr
346280 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view -bF 12 redup.bam >redup.F12.bam
samtools flagstat redup.F12.bam
7563632 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
7563632 + 0 mapped (100.00% : N/A)
7563632 + 0 paired in sequencing
3781816 + 0 read1
3781816 + 0 read2
4898198 + 0 properly paired (64.76% : N/A)
7563632 + 0 with itself and mate mapped #现在想要去掉这部分
0 + 0 singletons (0.00% : N/A)
2227594 + 0 with mate mapped to a different chr
346280 + 0 with mate mapped to a different chr (mapQ>=5)
参考链接:https://www.jianshu.com/p/34e3e3853fd5
https://www.cnblogs.com/xudongliang/p/5437850.html
https://pzweuj.github.io/2019/02/27/samtools.html
(base) [xxzhang@mu02 RepeatAnnoation]$ samtools view
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region …]
Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-X include customized index file
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-d STR:STR
only include reads with tag STR and associated value STR [null]
-D STR:FILE
only include reads with tag STR and associated values listed in
FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
fraction of templates/read pairs to keep; INT part sets seed)
-M use the multi-region iterator (increases the speed, removes
duplicates and outputs the reads as they are ordered in the file)
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-? print long help, including note about region specification
-S ignored (input format is auto-detected)
–no-PG do not add a PG line
–input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]…
Specify output format (SAM, BAM, CRAM)
–output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
–write-index
Automatically index the output files [off]
–verbosity INT
Set level of verbosity
理解了这里的flag的意思之后,很顺利的达成了目标。
(base) [xxzhang@mu02 RepeatAnnoation]$ samtools view redup.F12.bam -f 2 -bSh -o redup.F2.bam
(base) [xxzhang@mu02 RepeatAnnoation]$ samtools flagstat redup.F2.bam
4898198 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4898198 + 0 mapped (100.00% : N/A)
4898198 + 0 paired in sequencing
2449099 + 0 read1
2449099 + 0 read2
4898198 + 0 properly paired (100.00% : N/A)
4898198 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
4。提取我们感兴趣的reads
samtools view redup.F2.bam | grep "AluYe2" >AluYe2.sam
最后计算下来还要除以2,因为是双端的序列。
@SQ SN:FLA LN:134
@SQ SN:FLAM_A LN:142
@SQ SN:FLAM_C LN:143
@SQ SN:AluJo LN:283
@SQ SN:AluJb LN:283
@SQ SN:AluSc LN:280
@SQ SN:AluSg LN:281
@SQ SN:AluSp LN:284
@SQ SN:AluSq LN:284
@SQ SN:AluSx LN:283
@SQ SN:AluSz LN:283
@SQ SN:AluY LN:282
@SQ SN:AluYa5 LN:282
@SQ SN:AluYa8 LN:281
@SQ SN:AluYb8 LN:289
@SQ SN:AluYb9 LN:289
@SQ SN:AluYc1 LN:282
@SQ SN:AluYc2 LN:282
@SQ SN:AluYd2 LN:270
@SQ SN:AluYd3 LN:270
@SQ SN:AluYd3a1 LN:271
@SQ SN:AluYd8 LN:270
@SQ SN:AluYe5 LN:281
@SQ SN:LTR26B LN:531
@SQ SN:LTR26C LN:539
@SQ SN:LTR26D LN:604
@SQ SN:AluYa1 LN:282
@SQ SN:AluYa4 LN:282
@SQ SN:AluYb3a1 LN:282
@SQ SN:AluYb3a2 LN:282
@SQ SN:AluYf1 LN:282
@SQ SN:AluYg6 LN:282
@SQ SN:AluYh9 LN:282
@SQ SN:AluYi6 LN:283
@SQ SN:AluYbc3a LN:283
**@SQ SN:AluYe2 LN:280**
@SQ SN:AluYf2 LN:284
@SQ SN:AluJr LN:285
@SQ SN:AluJr4 LN:285
@SQ SN:AluMacYa3 LN:285
@SQ SN:AluMacYb2 LN:272
@SQ SN:AluMacYb4 LN:272
@SQ SN:AluSc5 LN:280
@SQ SN:AluSc8 LN:282
@SQ SN:AluSg1 LN:280
@SQ SN:AluSg4 LN:283
@SQ SN:AluSg7 LN:282
@SQ SN:AluSq10 LN:286
@SQ SN:AluSq2 LN:286
@SQ SN:AluSq4 LN:285
@SQ SN:AluSx1 LN:283
@SQ SN:AluSx3 LN:282
@SQ SN:AluSx4 LN:282
@SQ SN:AluSz6 LN:283
@SQ SN:AluYc5 LN:270
@SQ SN:AluYd3a1_gib LN:272
@SQ SN:AluYf5 LN:281
@SQ SN:AluYk11 LN:282
@SQ SN:AluYk12 LN:282
@SQ SN:AluYk13 LN:282
@SQ SN:FAM LN:168
@SQ SN:FLAM LN:124
@SQ SN:FRAM LN:156
@SQ SN:RPS2 LN:1051
@SQ SN:SVA_A LN:1640
@SQ SN:SVA_B LN:1383
@SQ SN:SVA_C LN:1384
@SQ SN:SVA_D LN:1386
@SQ SN:SVA_E LN:1382
@SQ SN:SVA_F LN:1375
@SQ SN:AluYb11 LN:289
@SQ SN:AluYb10 LN:288
@SQ SN:AluYb8a1 LN:287
接下来的事情就非常简单了。
samtools view redup.F2.bam | grep "AluYe2" >AluYe2.sam
cat AluYe2.sam | wc -l
#获取屏幕的输出的结果,格式为:AluYe2\tnumber(到时候直接画成histgram)
#甚至觉得有了那个list之后,可以直接用代码搞定。