python导出csv文件、统计每个出现的次数_在csv文件中计算python中的特定出现次数...

最新推荐文章于 2023-09-21 19:48:05 发布

weixin_39575775

最新推荐文章于 2023-09-21 19:48:05 发布

阅读量1.4k

点赞数

文章标签： python导出csv文件、统计每个出现的次数

该博客介绍如何使用Python的Pandas库对CSV文件进行数据处理，特别是针对特定集群ID和用户，统计好坏标签的数量。通过Pandas的groupby函数，可以高效地完成数据分组和计数操作，简化了对大量用户和多个集群ID的数据分析过程。

摘要由CSDN通过智能技术生成

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for each user, the number of good and bad tags(Obtained from the quality column). There are more than 6000 users. I can read only row by row in the csv file. Hence, I am not sure how this can be done.

For example:

Columns of csv = [Tag User Quality Cluster]

Row1= [bag u1 good 1]

Row2 = [ground u2 bad 2]

Row3 = [xxx u1 bad 1]

Row4 = [bbb u2 good 3]

I have just managed to get each row of the csv file.

I can only access each row at a time, not have two for loops. The psedudocode of the algorithm I want to implement is:

for cluster in clusters:

for user in users:

if eval == good:

good_num = good_num +1

else:

bad_num = bad_num + 1

Hope I am clear

解决方案

Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

df = pd.read_csv("cluster.csv")

counted = df.groupby(["Cluster_id", "User", "Quality"]).size()

df.to_csv("counted.csv")

Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

>>> import pandas as pd

>>> df = pd.read_csv("cluster.csv")

>>> df

Int64Index: 500000 entries, 0 to 499999

Data columns:

Tag 500000 non-null values

User 500000 non-null values

Quality 500000 non-null values