主要内容
- DataFrame.groupby().sum()
- DataFrame.groupby().agg()
- pandas.concat([DataFame1, DataFrame2])
- pandas.merge(DataFrame1, DataFrame2, parameters...)
- DataFrame1.join(DataFrame2, lsuffix='列名 on DataFrame1', rsuffix='列名 on DataFrame2')
- 帮助文档的获取
实例
构造dataframe
import pandas as pd
from numpy import random
from numpy.random import rand
import numpy as np
random.seed(42)
df = pd.DataFrame({'user_id':random.randint(0,6,size=10),
'food_id':random.randint(1,10,size=10),
'weather':['cold','hot','cold','hot','cold','cold','cold','hot','hot','hot'],
'food':['soup','soup','iceream','chocolate','iceream','iceream','iceream','soup','soup','chocolate'],
'price':10 * rand(10),
'number':random.randint(1,9,size=10)})
print(df)
numpy.random.seed()的使用:
seed()用于指定随机数生成时所用算法开始的整数值。
- 如果使用相同的seed( )值,则每次生成的随即数都相同;
- 如果不设置这个值,则系统根据时间来自己选择这个值,此时每次生成的随机数因时间差异而不同。
- 设置的seed()值仅一次有效
输出
user_id food_id weather food price number
0 3 4 cold soup 1.818250 6
1 4 8 hot soup 1.834045 6
2 2 8 cold iceream 3.042422 7
3 4 3 hot chocolate 5.247564 6
4 4 6 cold iceream 4.319450 3
5 1 5 cold iceream 2.912291 4
6 2 2 cold iceream 6.118529 7
7 2 8 hot soup 1.394939 4
8 2 6 hot soup 2.921446 8
9 4 2 hot chocolate 3.663618 1
groupby()函数使用
groupby1 = df.groupby(['usr_id'])
i = 0
for user_id, group in groupby1:
i = i + 1
print('group', user_id)
print(group)
输出
group 1
user_id food_id weather food price number
5 1 5 cold iceream 2.912291 4
group 2
user_id food_id weather food price number
2 2 8 cold iceream 3.042422 7
6 2 2 cold iceream 6.118529 7
7 2 8 hot soup 1.394939 4
8 2 6 hot soup 2.921446 8
group 3
user_id food_id weather food price number
0 3 4 cold soup 1.81825 6
group 4
user_id food_id weather food price number
1 4 8 hot soup 1.834045 6
3 4 3 hot chocolate 5.247564 6
4 4 6 cold iceream 4.319450 3
9 4 2 hot chocolate 3.663618 1
groupby和sum等函数结合使用
print(groupby1.sum()) #对除了groupby索引以外的每个数值列进行求和
print(groupby1['food_id','number'].sum()) #对除了groupby索引以外的特定数值列进行求和
print(df.groupby(['user_id'],as_index=False).sum()) #默认as_index=True
除了sum,还有mean,min,max,median,mode,std , mad等等,操作方法同理
参数axis=0,表示对行进行操作,即指定列中不同值进行分组;axis=1,表示对列进行分组
输出
food_id price number
user_id
1 5 2.912291 4
2 24 13.477336 26
3 4 1.818250 6
4 19 15.064678 16
food_id number
user_id
1 5 4
2 24 26
3 4 6
4 19 16
user_id food_id price number
0 1 5 2.912291 4
1 2 24 13.477336 26
2 3 4 1.818250 6
3 4 19 15.064678 16
agg函数的使用
print(df.groupby(['weather', 'food']).agg([np.mean, np.median]))
user_id food_id ... price number
mean median mean ... median mean median
weather food ...
cold iceream 2.250000 2 5.250000 ... 3.680936 5.25 5.5
soup 3.000000 3 4.000000 ... 1.818250 6.00 6.0
hot chocolate 4.000000 4 2.500000 ... 4.455591 3.50 3.5
soup 2.666667 2 7.333333 ... 1.834045 6.00 6.0
[4 rows x 8 columns]