python导出csv文件、统计每个出现的次数_在csv文件中计算python中的特定出现次数...

1586010002-jmsa.png

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for each user, the number of good and bad tags(Obtained from the quality column). There are more than 6000 users. I can read only row by row in the csv file. Hence, I am not sure how this can be done.

For example:

Columns of csv = [Tag User Quality Cluster]

Row1= [bag u1 good 1]

Row2 = [ground u2 bad 2]

Row3 = [xxx u1 bad 1]

Row4 = [bbb u2 good 3]

I have just managed to get each row of the csv file.

I can only access each row at a time, not have two for loops. The psedudocode of the algorithm I want to implement is:

for cluster in clusters:

for user in users:

if eval == good:

good_num = good_num +1

else:

bad_num = bad_num + 1

Hope I am clear

解决方案

Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

df = pd.read_csv("cluster.csv")

counted = df.groupby(["Cluster_id", "User", "Quality"]).size()

df.to_csv("counted.csv")

--

Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

>>> import pandas as pd

>>> df = pd.read_csv("cluster.csv")

>>> df

Int64Index: 500000 entries, 0 to 499999

Data columns:

Tag 500000 non-null values

User 500000 non-null values

Quality 500000 non-null values

Cluster_id 500000 non-null values

dtypes: int64(1), object(3)

We can check that the first few rows look okay:

>>> df[:5]

Tag User Quality Cluster_id

0 bbb u001 bad 39

1 bbb u002 bad 36

2 bag u003 good 11

3 bag u004 good 9

4 bag u005 bad 26

and then we can group by Cluster_id and User, and do work on each group:

>>> for name, group in df.groupby(["Cluster_id", "User"]):

... print 'group name:', name

... print 'group rows:'

... print group

... print 'counts of Quality values:'

... print group["Quality"].value_counts()

... raw_input()

...

group name: (1, 'u003')

group rows:

Tag User Quality Cluster_id

372002 xxx u003 bad 1

counts of Quality values:

bad 1

group name: (1, 'u004')

group rows:

Tag User Quality Cluster_id

126003 ground u004 bad 1

348003 ground u004 good 1

counts of Quality values:

good 1

bad 1

group name: (1, 'u005')

group rows:

Tag User Quality Cluster_id

42004 ground u005 bad 1

258004 ground u005 bad 1

390004 ground u005 bad 1

counts of Quality values:

bad 3

[etc.]

If you're going to be doing a lot of processing of csv files, it's definitely worth having a look at.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值