python只显示重复值_在python中的列中具有相同值的计数行

I'm trying to reproduce the R aggregate() function in python but without concatenating. For each line, I just want to count the number of occurrences of lines with a similar value in a given column.

The modifications I implemented are indicated by ###. The problem I am currently having is that the first column [0] contains character strings and the code seems to work only with floats.

import numpy as np

import scipy as sp

def MSD(vec):

return [np.mean(vec),np.std(vec)]

def aggregate(df,by=0,to=1,func=np.sum):

Dat = []

# ColBy = df.T[by]

ColBy = int(df.T[by][3:]) ### my attempt to read only the numbers in the first column's character strings

ColTo = df.T[to]

UniqueBy = np.sort(np.unique(ColBy))

for ub in UniqueBy:

uTo = ColTo[ColBy==ub]

Out = func(uTo)

# Dat.append(np.concatenate(([ub],Out)))

Dat.append([ub],Out) ### because I do not want to concatenate

return Dat

test_df = np.loadtxt('in_test.txt')

Agr = aggregate(test_df,0,3,MSD)

sp.savetxt("out_test.txt", Agr)

This is the error message:

Traceback (most recent call last):

File "count_same_reads.py", line 30, in

test_df = np.loadtxt('in_test.txt')

File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt

items = [conv(val) for (conv, val) in zip(converters, vals)]

ValueError: could not convert string to float: Tag19184

My data is tab-delimited, containing mostly strings, except for column 3 in which I want to write the number of occurrences of lines.

Here is the test data:

Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36

Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36

Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36

Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

The result should look like this:

Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36

Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36

Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36

Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

As you can probably tell, I'm not so good at python yet. Any advice would be welcome.

[EDIT] PS. The data is already sorted by column [0].

解决方案

I will suggest pandas, especially in your case of genomic data, the size of the data may be quite large:

In [44]:

#you can read you data by pandas.read_csv()

import pandas as pd

print df

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

0 Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36

1 Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36

2 Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36

3 Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36

4 Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36

5 Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36

6 Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36

7 Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36

8 Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36

9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

In [45]:

#if we want to group by the first 3 fields

df.groupby(['v0','v1','v2']).transform(sum).v3

Out[45]:

0 7

1 7

2 7

3 7

4 7

5 7

6 7

7 2

8 2

9 1

Name: v3, dtype: int64

In [46]:

#all it takes is just one line

df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3

print df

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

0 Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36

1 Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36

2 Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36

3 Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36

4 Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36

5 Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36

6 Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36

7 Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36

8 Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36

9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值