实验记录 | 提高运算时间的策略（1）

最新推荐文章于 2025-02-19 23:12:47 发布

今天也是个妖精头子呀

最新推荐文章于 2025-02-19 23:12:47 发布

阅读量1k

点赞数 3

分类专栏：谱系追踪文章标签： perl

本文链接：https://blog.csdn.net/weixin_40640700/article/details/120239544

版权

谱系追踪专栏收录该内容

82 篇文章

订阅专栏

问题描述：最近遇到的问题就是，要遍历的文本很大，处理起来非常的慢，现在想要优化这个算法，缩短程序运行的时间在我们可以接受的范围内。
目前可以考虑到的优化策略是：
（1）将数据文件拆分成若干小的文件，我们使用程序对其进行并行的运算，最终将运算的结果进行输出整合和计数。
（2）使用多核进行处理（我现在还不太明白老师说的核指的是什么意思？）。
（3）老师介绍的，数据结构上的优化，使用哈希，先大概的定位到一个位置，然后再对其进行细分（这个策略需要好好设计一下算法）

0。程序设计代码修改

版本1：修改了n>20nt
(base) [xxzhang@cu08 coordinate]$ perl chr_count_vertify_v3.pl interest.txt
$ARGV[0]==>interest.txt

awk '(($4<80029677) && (($4+$9)-80029677>20) )||($4>= 80029677&& (800299                                                                                                                                                             76-$4)>20 && $9>0 )' chr5_pair_nos.sam >./all_chr/chr5_80029677_AluY_SIN                                                                                                                                                             E_Alu.sam

chr5_80029677_AluY_SINE_Alu:

cat ./all_chr/chr5_80029677_AluY_SINE_Alu.sam | wc -l

57

=====================================================================================
版本2：修改了mapping quality>30
(base) [xxzhang@cu08 coordinate]$ perl chr_count_vertify_v4.pl interest.txt
$ARGV[0]==>interest.txt

awk '(($4<80029677) && (($4+$9)-80029677>20)&& $5>30 )||($4>= 80029677&& (800299                                                                                                                                                             76-$4)>20 && $9>0 && $5>30)' chr5_pair_nos.sam >./all_chr/chr5_80029677_AluY_SIN                                                                                                                                                             E_Alu.sam

chr5_80029677_AluY_SINE_Alu:

cat ./all_chr/chr5_80029677_AluY_SINE_Alu.sam | wc -l

52

版本3：尝试并行运算


cat chr5_pair_nos.sam | parallel --pipe awk '(($4<80029677) && (($4+$9)-80029677>20)&& $5>30 )||($4>= 80029677&& (800299                                                                                                                                                             76-$4)>20 && $9>0 && $5>30)' >./all_chr/chr5_80029677_AluY_SINE_Alu.sam
chr5_80029677_AluY_SINE_Alu:

cat ./all_chr/chr5_80029677_AluY_SINE_Alu.sam | wc -l

1。准备数据文件（对原始的数据文件进行二次的筛选）

这次我打算从头开始，整个文件的运行（不将其拆分为正负链和染色体了），在运行之前，势必要进行比较细致的设计。
另外，意识到我之前的算法可能有些毛病，现在想要重新调一下参数，看看结果会有怎样的变化。
重新开始。

(base) [xxzhang@cu08 coordinate]$ samtools flagstat result.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 18588100309 expect                                                                                                                                                             ed 12979 bytes; hread returned 6425
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
[bam_flagstat_core] Truncated file? Continue anyway.
371879542 + 0 in total (QC-passed reads + QC-failed reads)
200392800 + 0 secondary
8187102 + 0 supplementary
0 + 0 duplicates
371418561 + 0 mapped (99.88% : N/A)
163299640 + 0 paired in sequencing
81649820 + 0 read1
81649820 + 0 read2
153088770 + 0 properly paired (93.75% : N/A)
162483616 + 0 with itself and mate mapped
355043 + 0 singletons (0.22% : N/A)
5162418 + 0 with mate mapped to a different chr
1951473 + 0 with mate mapped to a different chr (mapQ>=5)

#我不懂这边为什么文件的比对会出错，明明这个bam文件是按照原先的sam文件按照常规的流程转换得到的？
出错：

(base) [xxzhang@cu08 coordinate]$ samtools sort result.bam >result.sort.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 18588100309 expected 12979 bytes; hread returned 6425
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools sort: truncated file. Aborting

2。linux的parallel并行处理的下载

安装。

 wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
 tar xjf parallel-latest.tar.bz2
 cd parallel-20210822/
./configure && make
cd src/
./parallel
pwd
 vi ~/.bashrc
 source ~/.bashrc
 #安装成功

time date
#想通过指令time计算某一命令运行的时间
#想通过这个时间比较不同的运算的方式，对于时间的消耗
#可以先拿五号染色体的再次练练手

常规用法：
cat rands20M.txt | awk ‘{s+=$1} END {print s}’
现在这样：
cat rands20M.txt | parallel --pipe awk ‘{s+=$1} END {print s}’ | awk ‘{s+=$1} END {print s}’
这个有点复杂：parallel命令中的–pipe参数将cat输出分成多个块分派给awk调用，形成了很多子计算操作。这些子计算经过第二个管道进入了同一个awk命令，从而输出最终结果。第一个awk有三个反斜杠，这是GNU parallel调用awk的需要。

(base) [xxzhang@cu08 coordinate]$ time cat result.sam |wc -l
3113192762

real    17m0.461s
user    1m19.506s
sys     12m18.517s

(base) [xxzhang@cu08 coordinate]$ time cat result.sam |parallel --pipe wc -l
3515
4600
3784
3895
4402
4509
3996
4851
3715
4458

real    40m19.973s
user    31m40.558s
sys     94m38.501s

#虽然从某种程度上说，第二种方式计算有多少行慢了很多，但是我们可以看到它在执行这个指令的过程中，实际上是把这个文件拆分成了很多小文件来处理的。

(base) [xxzhang@cu08 coordinate]$ ls |grep ^v |awk '{print "perl " $1}' |paralle                                                                                                                                                             l -j 2 {}

Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2021, August 22). GNU Parallel 20210822 ('Kabul').
  Zenodo. https://doi.org/10.5281/zenodo.5233953

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#Citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 11 times. Isn't it about time
you run 'parallel --citation' once to silence the citation notice?



awk '(($4<6069263) && (($4+$9)-6069263>20))||($4>= 6069263&& ($4+sqrt($9^2)-6069                                                                                                                                                             712)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6069263_MLT1C_LTR_ERVL-MaLR.sa                                                                                                                                                             m

chr5_6069263_MLT1C_LTR_ERVL-MaLR:

cat ./test1/chr5_6069263_MLT1C_LTR_ERVL-MaLR.sam | wc -l

75


awk '(($4<6072833) && (($4+$9)-6072833>20))||($4>= 6072833&& ($4+sqrt($9^2)-6073                                                                                                                                                             062)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6072833_L1PA7_LINE_L1.sam

chr5_6072833_L1PA7_LINE_L1:

cat ./test1/chr5_6072833_L1PA7_LINE_L1.sam | wc -l

18


awk '(($4<6075713) && (($4+$9)-6075713>20))||($4>= 6075713&& ($4+sqrt($9^2)-6075                                                                                                                                                             798)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6075713_zkhCTTCCCTykhn_Simple_                                                                                                                                                             repeat_Simple_repeat.sam

chr5_6075713_zkhCTTCCCTykhn_Simple_repeat_Simple_repeat:

cat ./test1/chr5_6075713_zkhCTTCCCTykhn_Simple_repeat_Simple_repeat.sam | wc -l

13


awk '(($4<6067174) && (($4+$9)-6067174>20))||($4>= 6067174&& ($4+sqrt($9^2)-6067                                                                                                                                                             567)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6067174_HSMAR2_DNA_TcMar-Marin                                                                                                                                                             er.sam

chr5_6067174_HSMAR2_DNA_TcMar-Mariner:

cat ./test1/chr5_6067174_HSMAR2_DNA_TcMar-Mariner.sam | wc -l

32


awk '(($4<6067605) && (($4+$9)-6067605>20))||($4>= 6067605&& ($4+sqrt($9^2)-6068                                                                                                                                                             268)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6067605_HSMAR2_DNA_TcMar-Marin                                                                                                                                                             er.sam

chr5_6067605_HSMAR2_DNA_TcMar-Mariner:

cat ./test1/chr5_6067605_HSMAR2_DNA_TcMar-Mariner.sam | wc -l

39


awk '(($4<6068564) && (($4+$9)-6068564>20))||($4>= 6068564&& ($4+sqrt($9^2)-6068                                                                                                                                                             722)<20 && $9>0)' chr5_pair_nos.sam >./test1/chr5_6068564_HSMAR2_DNA_TcMar-Marin                                                                                                                                                             er.sam

chr5_6068564_HSMAR2_DNA_TcMar-Mariner:

cat ./test1/chr5_6068564_HSMAR2_DNA_TcMar-Mariner.sam | wc -l

15

比较了一下，相当于是节省了一半的时间，我想把文件拆分成更小份，所需的时间就会更少。我觉得这条路是一个不错的尝试，明天打算更加深入的研究这个指令。很有希望的尝试。
明天把这个好好理一下！
早上吃了一个月饼，现在还不饿，所以现在集中心力好好干活。

3。修改perl代码，将文件的输入从直接读取文件到屏幕输入

按照我的理解，应该是先把输入的文件拆分成很多，然后再逐个的进行并行的遍历。

split -l 10000 -d filterfamily.txt chunk_
ls |grep 'chunk' | awk '{print "perl chr_count_vertify_v4.pl " $1}' |parallel {}

可以一下子处理32个文件。

尝试切换到fat节点：

(base) [xxzhang@mu02 ~]$ vi test.pbs

#PBS -N myjob
#PBS -q fat #切换到fat的关键步骤
#PBS -I nodes=cu02:ppn=32+cu04:ppn=32+cu05:ppn=32+cu06:ppn=32+cu08:ppn=32+fat02:ppn=40
cd /home/xxzhang/workplace/RepeatAnnoation/coordinate/family
ls |grep 'chunk' | awk '{print "perl chr_count_vertify_v4.pl " $1}' |parallel {}
cd /home/xxzhang/workplace/RepeatAnnoation/coordinate/family
ls |grep 'chunk' | awk '{print "perl chr_count_vertify_v4.pl " $1}' |parallel {}

(base) [xxzhang@mu02 ~]$ qsub test.pbs
#之后就进入到了fat这边了
(base) [xxzhang@fat02 ~]$ cd /home/xxzhang/workplace/RepeatAnnoation/coordinate/family/all_chr
(base) [xxzhang@fat02 all_chr]$ rm chr5*
(base) [xxzhang@fat02 all_chr]$ ls
(base) [xxzhang@fat02 all_chr]$ cd ..
(base) [xxzhang@fat02 family]$ ls |grep 'chunk' | awk '{print "perl chr_count_vertify_v4.pl " $1}' |parallel {}

这个结果是很惊人的，一下子就可以产出100个文件。

#PBS -N myjob
#PBS -q fat
#PBS -I nodes=6:ppn=32
#PBS -I nodes=cu02:ppn=32+cu04:ppn=32+cu05:ppn=32+cu06:ppn=32+cu08:ppn=32+fat02:ppn=40
cd /home/xxzhang/workplace/RepeatAnnoation/coordinate/family
ls |grep 'chunk' | awk '{print "perl chr_count_vertify_v4.pl " $1}' |parallel {}

72

运用这个策略，我用一晚上产生了5000个文件，而原先单线程的跑了14天，才跑了7000多个文件。效率有了实质的提高，但是似乎还远远不够。
因为我们一共有5607738行数据。我要跑三年。。。如果按照这个效率来计算的话，感觉这个还远远不够，且我们没有用所有的数据来做，而是用了五号染色体的数据，所以实际上要比这会更多。

现在想要尝试更多其他的方法。