linux 统计字符串个数字,Linux统计某一列中特定字符串的个数

最新推荐文章于 2021-12-14 22:45:07 发布

治治吱吱芝芝之

最新推荐文章于 2021-12-14 22:45:07 发布

阅读量439

点赞数

文章标签： linux 统计字符串个数字

问题描述：想统计一下vcf文件中FILTER这一列中，是PASS的个数

我们知道，vcf文件包括的是前面很长的注释，以及后面的正文部分。注释都是以#开头的，所以我的思路是，

把正文提取出来；

然后提取其中的FILTER这一列；

最后把这一列的PASS提取出来，最后是统计个数。

因为所有的注释行都是以#开头的，所以用grep -v提取非注释行，就是正文，还有一个就是“^”的用法，这个正则表达式，也是linux 中一个比较难的部分。主要是需要记住的部分很多。可以参考http://www.runoob.com/regexp/regexp-metachar.html 里面总结了各种字符。

[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep -v '^#' |head

chrUn_JTFH01001680v1_decoy 528 . G A . clustered_events;contamination;t_lod DP=336;ECNT=14;NLOD=3.01;N_ART_LOD=-1.041e+00;POP_AF=2.500e-06;P_CONTAM=0.908;P_GERMLINE=-9.168e+01;TLOD=4.25 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/0:10,0:8.505e-05:6,0:4,0:35,0:252,0:0:0:0|1:528_G_A 0/1:319,3:0.015:147,2:172,1:35,36:234,234:40:26:0|1:528_G_A:0.010,0.00,9.317e-03:8.225e-03,2.072e-03,0.990

chrUn_JTFH01001680v1_decoy 532 . T A . clustered_events;contamination;t_lod DP=322;ECNT=14;NLOD=2.71;N_ART_LOD=-7.404e-01;POP_AF=2.500e-06;P_CONTAM=0.898;P_GERMLINE=-8.888e+01;TLOD=4.60 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/0:9,0:4.700e-05:5,0:4,0:34,0:254,0:0:0:0|1:528_G_A 0/1:310,3:9.632e-03:145,2:165,1:33,35:236,234:40:22:0|1:528_G_A:0.010,0.00,9.585e-03:8.677e-03,2.116e-03,0.989

chrUn_JTFH01001862v1_decoy 32 . C A . clustered_events DP=477;ECNT=4;NLOD=3.59;N_ART_LOD=-8.451e-01;POP_AF=2.500e-06;P_CONTAM=2.127e-03;P_GERMLINE=-1.145e+02;TLOD=9.79 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:12,0:4.012e-03:7,0:5,0:36,0:204,0:0:0 0/1:426,7:0.051:183,5:243,2:34,35:199,176:40:9:0.010,0.020,0.016:1.705e-03,0.013,0.985

chrUn_JTFH01001862v1_decoy 36 . A T . clustered_events;panel_of_normals DP=534;ECNT=4;IN_PON;NLOD=3.61;N_ART_LOD=-1.114e+00;POP_AF=2.500e-06;P_CONTAM=0.00;P_GERMLINE=-1.061e+02;TLOD=38.63 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:12,0:1.442e-04:7,0:5,0:38,0:204,0:0:0 0/1:457,22:0.083:204,8:253,14:35,33:195,191:40:22:0.030,0.040,0.046:4.988e-03,6.120e-03,0.989

查看FILTER 这一列是第几行，可以把注释信息的最后一列提出来，就可以知道正文都包含哪些列了。数了一下发现是第7列

[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep '^#' |tail -1

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 57_B 57_T

然后可以用awk把第七列提取出来，其实想的是直接按照FILTER提取，但是找了一圈没有找到。

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |head

contamination;mapping_quality;t_lod

clustered_events;contamination;t_lod

artifact_in_normal;contamination;t_lod

artifact_in_normal;str_contraction

contamination;str_contraction;t_lod

base_quality;clustered_events;t_lod

clustered_events

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |head

PASS

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |wc -l

2136

有一个问题是grep 和grep -o有什么区别呢？官方讲的是grep -o是-o, --only-matching show only the part of a line matching PATTERN，也就是说只会把字符串部分输出来。但是如果是grep的话，会把含有字符串的这一行都输出来。以下是例子

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "clu" |head

clustered_events;contamination;t_lod

base_quality;clustered_events;t_lod

clustered_events

base_quality;clustered_events;multiallelic;panel_of_normals

clustered_events

base_quality;clustered_events;contamination;read_position

clustered_events;contamination

并且的话grep -o会把字符串单独输出到一行。

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "clu" |head

clu

下面的例子提现了两种方式输出的行数不一样

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "_" |wc -l

54211

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "_" |wc -l

107908

但是我的目的是统计这一列是PASS的情况，也就是说这一列中除了PASS再也没有其他的情况。所以应该是用grep，然后以PASS开始，以PASS结束

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l

2136

我要统计很多个文件的话，就可以写成shell脚本了。

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}

echo -n "${i}_T2B_somatic.vcf.gz " #不换行的方式1

echo -e "the number is \c" #不换行的方式2

less -S ${i}_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l

done

[qiany@gm112-2 T2B]$ sh countpsaanum.sh

10_T2B_somatic.vcf.gz the number is 470

20_T2B_somatic.vcf.gz the number is 294

23_T2B_somatic.vcf.gz the number is 467

24_T2B_somatic.vcf.gz the number is 487

29_T2B_somatic.vcf.gz the number is 306

3_T2B_somatic.vcf.gz the number is 258

35_T2B_somatic.vcf.gz the number is 249

39_T2B_somatic.vcf.gz the number is 427

44_T2B_somatic.vcf.gz the number is 366

5_T2B_somatic.vcf.gz the number is 729

6_T2B_somatic.vcf.gz the number is 432

61_T2B_somatic.vcf.gz the number is 2136

65_T2B_somatic.vcf.gz the number is 308

8_T2B_somatic.vcf.gz the number is 1776

52_T2B_somatic.vcf.gz the number is 371

55_T2B_somatic.vcf.gz the number is 447

57_T2B_somatic.vcf.gz the number is 281

58_T2B_somatic.vcf.gz the number is 253

59_T2B_somatic.vcf.gz the number is 262

但是个人还是比较习惯Python的方式。

#!/usr/bin/env python

import re

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:

num = 0

infile = open( "10_T2B_somatic.vcf.gz", "r")

for line in infile:

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is" + num)

infile.close()

[qiany@gm112-2 T2B]$ python counttest.py

Traceback (most recent call last):

File "counttest.py", line 6, in

for line in infile:

File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/codecs.py", line 321, in decode

(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is " + str(num))

[qiany@gm112-2 T2B]$ python2 counttest.py

10_T2B_somatic.vcf.gz is 470

但是还是有问题，当我用Python2的时候，是能够运行的，但是用python3却提示错误。

[qiany@gm112-2 T2B]$ python3 counttest.py

Traceback (most recent call last):

File "counttest.py", line 17, in

if re.match("^#", line) is None:

File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/re.py", line 172, in match

return _compile(pattern, flags).match(string)

TypeError: cannot use a string pattern on a bytes-like object

查了一下原因说是py3的urlopen返回的不是string是bytes所以需要在正则表达式之前加一个编码

#coding=utf-8

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

line = line.decode('utf-8') #这里py3的urlopen返回的不是string是bytes，所以需要加入decode('utf-8') 编码

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is " + str(num))

[qiany@gm112-2 T2B]$ python3 counttest.py

10_T2B_somatic.vcf.gz is 470

再把原来的循环加上

#coding=utf-8

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/" + str(i) + "_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

line = line.decode('utf-8')

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print(str(i) + "_T2B_somatic.vcf.gz is " + str(num))

运行一下结果

[qiany@gm112-2 T2B]$ python3 counttest.py

65_T2B_somatic.vcf.gz is 308

3_T2B_somatic.vcf.gz is 258

35_T2B_somatic.vcf.gz is 249

5_T2B_somatic.vcf.gz is 729

6_T2B_somatic.vcf.gz is 432

39_T2B_somatic.vcf.gz is 427

8_T2B_somatic.vcf.gz is 1776

10_T2B_somatic.vcf.gz is 470

44_T2B_somatic.vcf.gz is 366

61_T2B_somatic.vcf.gz is 2136

20_T2B_somatic.vcf.gz is 294

52_T2B_somatic.vcf.gz is 371

55_T2B_somatic.vcf.gz is 447

23_T2B_somatic.vcf.gz is 467

24_T2B_somatic.vcf.gz is 487

57_T2B_somatic.vcf.gz is 281

58_T2B_somatic.vcf.gz is 253

59_T2B_somatic.vcf.gz is 262

29_T2B_somatic.vcf.gz is 306

这个结果与shell是一致的。

治治吱吱芝芝之

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
linux 统计字符串个数字,Linux统计某一列中特定字符串的个数

问题描述：想统计一下vcf文件中FILTER这一列中，是PASS的个数我们知道，vcf文件包括的是前面很长的注释，以及后面的正文部分。注释都是以#开头的，所以我的思路是，把正文提取出来；然后提取其中的FILTER这一列；最后把这一列的PASS提取出来，最后是统计个数。因为所有的注释行都是以#开头的，所以用grep -v提取非注释行，就是正文，还有一个就是“^”的用法，这个正则表达式，也是linux...
复制链接

扫一扫