摘要
最近在弄腾讯的广告大赛,使用了Pandas作为数据处理的工具,在此记录下使用的方法,以作分享。
参考文献:http://jingyan.baidu.com/season/43456?pn=0
R语言教程:https://edu.aliyun.com/course/27/lesson/list
安装
功能
读取csv文件
df = pd.read_csv('./original_data/user_installedapps.csv',index_col='userID')
index_col用于指定索引列
groupby
以下几行代码实现类似于数据库里group by后计数的功能,数据库的同功能SQL语句如下
select userID, count(*) from user_installedapps group by userID;
! 注意 : groupby的列必须是索引列
df = pd.read_csv('./original_data/user_installedapps.csv',index_col='userID')
grouped_userid = df.groupby(level='userID')
print grouped_userid.count()
//若是要统计总和可以改成sum()
//print grouped_userid.sum()
下面是我写的API,需要的看着改吧
import pandas as pd
conf = {
# The path of input csv file
'input_file_path': './original_data/user_installedapps.csv',
# The index of DataFrame object, which essential for the use of groupby() function
'index_col': ['userID'],
# The col which required to by grouped by
'group_by_col': ['userID'],
# The path of output csv file
'output_file_path': r'./processed_data/user_installedapps_count.csv'
}
# <1> File Import and Preparation *****************************
print 'Importing File'
df = pd.read_csv(conf['input_file_path'], index_col=conf['index_col'])
grouped_df = df.groupby(level=conf['index_col'])
print 'Importing File Finished'
# <2> Data Processing Period *********************************
# The function could be changed as required
print 'Data Processing'
grouped_df_count = grouped_df.count()
# grouped_df_sum = grouped_df.sum()
print 'Data Processing Finished'
# <3> File Export ********************************************
print 'Exporting File'
grouped_df_count.to_csv(conf['output_file_path'], encoding='gbk')
# grouped_df_sum.to_csv(conf['output_file_path'], encoding='gbk')
print 'Exporting File Finished'
合并csv文件
#coding:utf-8
import csv
import pandas as pd
conf = {
'base_path': './original_data'
}
# <1> File Import and Preparation *****************************
print 'Importing File'
train = pd.read_csv(conf['base_path'] + '/train.csv')
ad = pd.read_csv(conf['base_path'] + '/ad.csv')
app_categories = pd.read_csv(conf['base_path'] + '/app_categories.csv')
position = pd.read_csv(conf['base_path'] + '/position.csv')
test = pd.read_csv(conf['base_path'] + '/test.csv')
user = pd.read_csv(conf['base_path'] + '/user.csv')
# user_app_actions = pd.read_csv('./data/csv/user_app_actions.csv')
# user_installedapps = pd.read_csv('./data/csv/user_installedapps.csv')
print 'Importing File Finished'
# <2> Data Processing Period *********************************
# The function could be changed as required
print 'Data Processing'
data = pd.merge(train, user, on=['userID'], how='left')
data = pd.merge(data, ad, on=['creativeID'], how='left')
data = pd.merge(data, position, on=['positionID'], how='left')
data = pd.merge(data, app_categories, on=['appID'], how='left')
print 'Data Processing Finished'
# <3> File Export ********************************************
print 'Exporting File'
data.to_csv(conf['base_path'] + '/merge1.csv', encoding='gbk')
print 'Exporting File Finished'