python只显示重复值_在python中的列中具有相同值的计数行

最新推荐文章于 2022-08-25 11:23:15 发布

weixin_39579726

最新推荐文章于 2022-08-25 11:23:15 发布

阅读量165

点赞数

文章标签： python只显示重复值

I'm trying to reproduce the R aggregate() function in python but without concatenating. For each line, I just want to count the number of occurrences of lines with a similar value in a given column.

The modifications I implemented are indicated by ###. The problem I am currently having is that the first column [0] contains character strings and the code seems to work only with floats.

import numpy as np

import scipy as sp

def MSD(vec):

return [np.mean(vec),np.std(vec)]

def aggregate(df,by=0,to=1,func=np.sum):

Dat = []

# ColBy = df.T[by]

ColBy = int(df.T[by][3:]) ### my attempt to read only the numbers in the first column's character strings

ColTo = df.T[to]

UniqueBy = np.sort(np.unique(ColBy))

for ub in UniqueBy:

uTo = ColTo[ColBy==ub]

Out = func(uTo)

# Dat.append(np.concatenate(([ub],Out)))

Dat.append([ub],Out) ### because I do not want to concatenate

return Dat

test_df = np.loadtxt('in_test.txt')

Agr = aggregate(test_df,0,3,MSD)

sp.savetxt("out_test.txt", Agr)

This is the error message:

Traceback (most recent call last):

File "count_same_reads.py", line 30, in

test_df = np.loadtxt('in_test.txt')

File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt

items = [conv(val) for (conv, val) in zip(converters, vals)]

ValueError: could not convert string to float: Tag19184

My data is tab-delimited, containing mostly strings, except for column 3 in which I want to write the number of occurrences of lines.

Here is the test data:

Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36

Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36

Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36

Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36

Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

The result should look like this:

Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36

Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36

Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36

Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36

Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

As you can probably tell, I'm not so good at python yet. Any advice would be welcome.

[EDIT] PS. The data is already sorted by column [0].

解决方案

I will suggest pandas, especially in your case of genomic data, the size of the data may be quite large:

In [44]:

#you can read you data by pandas.read_csv()

import pandas as pd

print df

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

0 Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36

1 Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36

2 Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36

3 Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36

4 Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36

5 Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36

6 Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36

7 Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36

8 Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36

9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

In [45]:

#if we want to group by the first 3 fields

df.groupby(['v0','v1','v2']).transform(sum).v3

Out[45]:

0 7

1 7

2 7

3 7

4 7

5 7

6 7

7 2

8 2

9 1

Name: v3, dtype: int64

In [46]:

#all it takes is just one line

df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3

print df

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

0 Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36

1 Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36

2 Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36

3 Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36

4 Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36

5 Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36

6 Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36

7 Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36

8 Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36

9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36

weixin_39579726

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python只显示重复值_在python中的列中具有相同值的计数行

I'm trying to reproduce the R aggregate() function in python but without concatenating. For each line, I just want to count the number of occurrences of lines with a similar value in a given column.Th...
复制链接

扫一扫