Pandas学习笔记

摘要

最近在弄腾讯的广告大赛,使用了Pandas作为数据处理的工具,在此记录下使用的方法,以作分享。

参考文献:http://jingyan.baidu.com/season/43456?pn=0

R语言教程:https://edu.aliyun.com/course/27/lesson/list


安装


功能

读取csv文件

df = pd.read_csv('./original_data/user_installedapps.csv',index_col='userID')

index_col用于指定索引列

groupby

以下几行代码实现类似于数据库里group by后计数的功能,数据库的同功能SQL语句如下

select userID, count(*) from user_installedapps group by userID;

! 注意 : groupby的列必须是索引列

df = pd.read_csv('./original_data/user_installedapps.csv',index_col='userID')

grouped_userid = df.groupby(level='userID')

print grouped_userid.count()

//若是要统计总和可以改成sum()
//print grouped_userid.sum()

下面是我写的API,需要的看着改吧

import pandas as pd

conf = {
    # The path of input csv file
    'input_file_path': './original_data/user_installedapps.csv',

    # The index of DataFrame object, which essential for the use of groupby() function
    'index_col': ['userID'],

    # The col which required to by grouped by
    'group_by_col': ['userID'],

    # The path of output csv file
    'output_file_path': r'./processed_data/user_installedapps_count.csv'
}

# <1> File Import and Preparation *****************************
print 'Importing File'

df = pd.read_csv(conf['input_file_path'], index_col=conf['index_col'])

grouped_df = df.groupby(level=conf['index_col'])

print 'Importing File Finished'


# <2> Data Processing Period  *********************************
# The function could be changed as required
print 'Data Processing'

grouped_df_count = grouped_df.count()
# grouped_df_sum = grouped_df.sum()

print 'Data Processing Finished'


# <3> File Export  ********************************************
print 'Exporting File'

grouped_df_count.to_csv(conf['output_file_path'], encoding='gbk')
# grouped_df_sum.to_csv(conf['output_file_path'], encoding='gbk')

print 'Exporting File Finished'

合并csv文件

#coding:utf-8
import csv
import pandas as pd

conf = {
    'base_path': './original_data'
}

# <1> File Import and Preparation *****************************
print 'Importing File'

train          = pd.read_csv(conf['base_path'] + '/train.csv')
ad             = pd.read_csv(conf['base_path'] + '/ad.csv')
app_categories = pd.read_csv(conf['base_path'] + '/app_categories.csv')
position       = pd.read_csv(conf['base_path'] + '/position.csv')
test           = pd.read_csv(conf['base_path'] + '/test.csv')
user           = pd.read_csv(conf['base_path'] + '/user.csv')
# user_app_actions = pd.read_csv('./data/csv/user_app_actions.csv')
# user_installedapps = pd.read_csv('./data/csv/user_installedapps.csv')

print 'Importing File Finished'



# <2> Data Processing Period  *********************************
# The function could be changed as required
print 'Data Processing'

data = pd.merge(train, user, on=['userID'], how='left')
data = pd.merge(data, ad, on=['creativeID'], how='left')
data = pd.merge(data, position, on=['positionID'], how='left')
data = pd.merge(data, app_categories, on=['appID'], how='left')

print 'Data Processing Finished'



# <3> File Export  ********************************************
print 'Exporting File'

data.to_csv(conf['base_path'] + '/merge1.csv', encoding='gbk')

print 'Exporting File Finished'
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值