linux 统计 字符串个数字,Linux统计某一列中特定字符串的个数

问题描述:想统计一下vcf文件中FILTER这一列中,是PASS的个数

我们知道,vcf文件包括的是前面很长的注释,以及后面的正文部分。注释都是以#开头的,所以我的思路是,

把正文提取出来;

然后提取其中的FILTER这一列;

最后把这一列的PASS提取出来,最后是统计个数。

因为所有的注释行都是以#开头的,所以用grep -v提取非注释行,就是正文,还有一个就是“^”的用法,这个正则表达式,也是linux 中一个比较难的部分。主要是需要记住的部分很多。可以参考http://www.runoob.com/regexp/regexp-metachar.html 里面总结了各种字符。

[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep -v '^#' |head

chrUn_JTFH01001680v1_decoy 528 . G A . clustered_events;contamination;t_lod DP=336;ECNT=14;NLOD=3.01;N_ART_LOD=-1.041e+00;POP_AF=2.500e-06;P_CONTAM=0.908;P_GERMLINE=-9.168e+01;TLOD=4.25 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/0:10,0:8.505e-05:6,0:4,0:35,0:252,0:0:0:0|1:528_G_A 0/1:319,3:0.015:147,2:172,1:35,36:234,234:40:26:0|1:528_G_A:0.010,0.00,9.317e-03:8.225e-03,2.072e-03,0.990

chrUn_JTFH01001680v1_decoy 532 . T A . clustered_events;contamination;t_lod DP=322;ECNT=14;NLOD=2.71;N_ART_LOD=-7.404e-01;POP_AF=2.500e-06;P_CONTAM=0.898;P_GERMLINE=-8.888e+01;TLOD=4.60 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/0:9,0:4.700e-05:5,0:4,0:34,0:254,0:0:0:0|1:528_G_A 0/1:310,3:9.632e-03:145,2:165,1:33,35:236,234:40:22:0|1:528_G_A:0.010,0.00,9.585e-03:8.677e-03,2.116e-03,0.989

chrUn_JTFH01001862v1_decoy 32 . C A . clustered_events DP=477;ECNT=4;NLOD=3.59;N_ART_LOD=-8.451e-01;POP_AF=2.500e-06;P_CONTAM=2.127e-03;P_GERMLINE=-1.145e+02;TLOD=9.79 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:12,0:4.012e-03:7,0:5,0:36,0:204,0:0:0 0/1:426,7:0.051:183,5:243,2:34,35:199,176:40:9:0.010,0.020,0.016:1.705e-03,0.013,0.985

chrUn_JTFH01001862v1_decoy 36 . A T . clustered_events;panel_of_normals DP=534;ECNT=4;IN_PON;NLOD=3.61;N_ART_LOD=-1.114e+00;POP_AF=2.500e-06;P_CONTAM=0.00;P_GERMLINE=-1.061e+02;TLOD=38.63 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:12,0:1.442e-04:7,0:5,0:38,0:204,0:0:0 0/1:457,22:0.083:204,8:253,14:35,33:195,191:40:22:0.030,0.040,0.046:4.988e-03,6.120e-03,0.989

查看FILTER 这一列是第几行,可以把注释信息的最后一列提出来,就可以知道正文都包含哪些列了。数了一下发现是第7列

[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep '^#' |tail -1

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 57_B 57_T

然后可以用awk把第七列提取出来,其实想的是直接按照FILTER提取,但是找了一圈没有找到。

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |head

contamination;mapping_quality;t_lod

clustered_events;contamination;t_lod

clustered_events;contamination;t_lod

clustered_events;contamination;t_lod

artifact_in_normal;contamination;t_lod

artifact_in_normal;str_contraction

contamination;str_contraction;t_lod

base_quality;clustered_events;t_lod

clustered_events

clustered_events

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |head

PASS

PASS

PASS

PASS

PASS

PASS

PASS

PASS

PASS

PASS

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |wc -l

2136

有一个问题是grep 和grep -o有什么区别呢?官方讲的是grep -o是-o, --only-matching show only the part of a line matching PATTERN,也就是说只会把字符串部分输出来。但是如果是grep的话,会把含有字符串的这一行都输出来。以下是例子

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "clu" |head

clustered_events;contamination;t_lod

clustered_events;contamination;t_lod

clustered_events;contamination;t_lod

base_quality;clustered_events;t_lod

clustered_events

clustered_events

base_quality;clustered_events;multiallelic;panel_of_normals

clustered_events

base_quality;clustered_events;contamination;read_position

clustered_events;contamination

并且的话grep -o会把字符串单独输出到一行。

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "clu" |head

clu

clu

clu

clu

clu

clu

clu

clu

clu

clu

下面的例子提现了两种方式输出的行数不一样

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "_" |wc -l

54211

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "_" |wc -l

107908

但是我的目的是统计这一列是PASS的情况,也就是说这一列中除了PASS再也没有其他的情况。所以应该是用grep,然后以PASS开始,以PASS结束

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l

2136

我要统计很多个文件的话,就可以写成shell脚本了。

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}

do

echo -n "${i}_T2B_somatic.vcf.gz " #不换行的方式1

echo -e "the number is \c" #不换行的方式2

less -S ${i}_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l

done

[qiany@gm112-2 T2B]$ sh countpsaanum.sh

10_T2B_somatic.vcf.gz the number is 470

20_T2B_somatic.vcf.gz the number is 294

23_T2B_somatic.vcf.gz the number is 467

24_T2B_somatic.vcf.gz the number is 487

29_T2B_somatic.vcf.gz the number is 306

3_T2B_somatic.vcf.gz the number is 258

35_T2B_somatic.vcf.gz the number is 249

39_T2B_somatic.vcf.gz the number is 427

44_T2B_somatic.vcf.gz the number is 366

5_T2B_somatic.vcf.gz the number is 729

6_T2B_somatic.vcf.gz the number is 432

61_T2B_somatic.vcf.gz the number is 2136

65_T2B_somatic.vcf.gz the number is 308

8_T2B_somatic.vcf.gz the number is 1776

52_T2B_somatic.vcf.gz the number is 371

55_T2B_somatic.vcf.gz the number is 447

57_T2B_somatic.vcf.gz the number is 281

58_T2B_somatic.vcf.gz the number is 253

59_T2B_somatic.vcf.gz the number is 262

但是个人还是比较习惯Python的方式。

#!/usr/bin/env python

import re

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:

num = 0

infile = open( "10_T2B_somatic.vcf.gz", "r")

for line in infile:

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is" + num)

infile.close()

[qiany@gm112-2 T2B]$ python counttest.py

Traceback (most recent call last):

File "counttest.py", line 6, in

for line in infile:

File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/codecs.py", line 321, in decode

(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is " + str(num))

[qiany@gm112-2 T2B]$ python2 counttest.py

10_T2B_somatic.vcf.gz is 470

但是还是有问题,当我用Python2的时候,是能够运行的,但是用python3却提示错误。

[qiany@gm112-2 T2B]$ python3 counttest.py

Traceback (most recent call last):

File "counttest.py", line 17, in

if re.match("^#", line) is None:

File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/re.py", line 172, in match

return _compile(pattern, flags).match(string)

TypeError: cannot use a string pattern on a bytes-like object

查了一下原因说是py3的urlopen返回的不是string是bytes所以需要在正则表达式之前加一个编码

#coding=utf-8

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

line = line.decode('utf-8') #这里py3的urlopen返回的不是string是bytes,所以需要加入decode('utf-8') 编码

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print("10_T2B_somatic.vcf.gz is " + str(num))

[qiany@gm112-2 T2B]$ python3 counttest.py

10_T2B_somatic.vcf.gz is 470

再把原来的循环加上

#coding=utf-8

import re

import os

import os.path

import gzip

def read_gz_file(path):

if os.path.exists(path):

with gzip.open(path, 'r') as infile:

for line in infile:

yield line

else:

print('the path [{}] is not exist!'.format(path))

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:

num = 0

con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/" + str(i) + "_T2B_somatic.vcf.gz")

if getattr(con, '__iter__', None):

for line in con:

line = line.decode('utf-8')

if re.match("^#", line) is None:

if line.split()[6] == "PASS":

num = num + 1

print(str(i) + "_T2B_somatic.vcf.gz is " + str(num))

运行一下结果

[qiany@gm112-2 T2B]$ python3 counttest.py

65_T2B_somatic.vcf.gz is 308

3_T2B_somatic.vcf.gz is 258

35_T2B_somatic.vcf.gz is 249

5_T2B_somatic.vcf.gz is 729

6_T2B_somatic.vcf.gz is 432

39_T2B_somatic.vcf.gz is 427

8_T2B_somatic.vcf.gz is 1776

10_T2B_somatic.vcf.gz is 470

44_T2B_somatic.vcf.gz is 366

61_T2B_somatic.vcf.gz is 2136

20_T2B_somatic.vcf.gz is 294

52_T2B_somatic.vcf.gz is 371

55_T2B_somatic.vcf.gz is 447

23_T2B_somatic.vcf.gz is 467

24_T2B_somatic.vcf.gz is 487

57_T2B_somatic.vcf.gz is 281

58_T2B_somatic.vcf.gz is 253

59_T2B_somatic.vcf.gz is 262

29_T2B_somatic.vcf.gz is 306

这个结果与shell是一致的。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值