第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化
In[1]: import pandas as pd
import numpy as np
1. 用stack清理变量值作为列名
# 加载state_fruit数据集
In[2]: state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
state_fruit
out[2]:
![img_2d3f19fa1fad844566737f0f0ca19065.png](https://i-blog.csdnimg.cn/blog_migrate/c6c93e188b9fc323ab1aa21b7cbe4202.png)
# stack方法可以将所有列名,转变为垂直的一级行索引
In[3]: state_fruit.stack()
out[3]: Texas Apple 12
Orange 10
Banana 40
Arizona Apple 9
Orange 7
Banana 12
Florida Apple 0
Orange 14
Banana 190
dtype: int64
# 使用reset_index(),将结果变为DataFrame
In[4]: state_fruit_tidy = state_fruit.stack().reset_index()
state_fruit_tidy
out[4]:
![img_7ba529ab303b17338991cb3b67aae75b.png](https://i-blog.csdnimg.cn/blog_migrate/529fffb5fd86b461d3d2a26b78b1c8b6.png)
# 重命名列名
In[5]: state_fruit_tidy.columns = ['state', 'fruit', 'weight']
state_fruit_tidy
out[5]:
![img_7cb59d2aff2e074673dbdfea37bf33c9.png](https://i-blog.csdnimg.cn/blog_migrate/572f8939c50c4bddc40ec490bca1242d.png)
# 也可以使用rename_axis给不同的行索引层级命名
In[6]: state_fruit.stack()\
.rename_axis(['state', 'fruit'])\
out[6]: state fruit
Texas Apple 12
Orange 10
Banana 40
Arizona Apple 9
Orange 7
Banana 12
Florida Apple 0
Orange 14
Banana 190
dtype: int64
# 再次使用reset_index方法
In[7]: state_fruit.stack()\
.rename_axis(['state', 'fruit'])\
.reset_index(name='weight')
out[7]:
![img_8651151c83ecff8a6e673e2125d0a8a6.png](https://i-blog.csdnimg.cn/blog_migrate/f332d031a0eea6eeae8fcb941dd5c2c6.png)
更多
# 读取state_fruit2数据集
In[8]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
state_fruit2
out[8]:
![img_0e40fff35ec99bdb90c32c8f194a2590.png](https://i-blog.csdnimg.cn/blog_migrate/49d37b05ea637a744bc603b758c26a44.png)
# 州名不在行索引的位置上,使用stack将所有列名变为一个长Series
In[9]: state_fruit2.stack()
out[9]: 0 State Texas
Apple 12
Orange 10
Banana 40
1 State Arizona
Apple 9
Orange 7
Banana 12
2 State Florida
Apple 0
Orange 14
Banana 190
dtype: object
# 先设定state作为行索引名,再stack,可以得到和前面相似的结果
In[10]: state_fruit2.set_index('State').stack()
out[10]: 0 State Texas
Apple 12
Orange 10
Banana 40
1 State Arizona
Apple 9
Orange 7
Banana 12
2 State Florida
Apple 0
Orange 14
Banana 190
dtype: object
2. 用melt清理变量值作为列名
# 读取state_fruit2数据集
In[11]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
state_fruit2
out[11]:
![img_6db8941a099dd3091c10d39057d8134a.png](https://i-blog.csdnimg.cn/blog_migrate/144b7622f4eee48408093ce53705d245.png)
# 使用melt方法,将列传给id_vars和value_vars。melt可以将原先的列名作为变量,原先的值作为值。
In[12]: state_fruit2.melt(id_vars=['State'],
value_vars=['Apple', 'Orange', 'Banana'])
out[12]:
![img_00d23ff2c6a8fef8dc6d1ddd4ab33638.png](https://i-blog.csdnimg.cn/blog_migrate/26e924e31b97ecf21e044f28d81b3882.png)
# 随意设定一个行索引
In[13]: state_fruit2.index=list('abc')
state_fruit2.index.name = 'letter'
In[14]: state_fruit2
out[14]:
![img_2ad059a62773ae9facf503cae78e9c4a.png](https://i-blog.csdnimg.cn/blog_migrate/a79431dfb8057e3f49a8c4a631680244.png)
# var_name和value_name可以用来重命名新生成的变量列和值的列
In[15]: state_fruit2.melt(id_vars=['State'],
value_vars=['Apple', 'Orange', 'Banana'],
var_name='Fruit',
value_name='Weight')
out[15]:
![img_922b3930de70c888c58f47d4288774ea.png](https://i-blog.csdnimg.cn/blog_migrate/c1eff8b511adf755fe54d98df737941a.png)
# 如果你想让所有值都位于一列,旧的列标签位于另一列,可以直接使用melt
In[16]: state_fruit2.melt()
out[16]:
![img_2ce3adf33fc25f7e830ba8286338df28.png](https://i-blog.csdnimg.cn/blog_migrate/4213d706e7a0fcb680a12692a47bfc22.png)
# 要指明id变量,只需使用id_vars参数
In[17]: state_fruit2.melt(id_vars='State')
out[17]:
![img_8548df6c6fcb6bcf756dafde319ba57e.png](https://i-blog.csdnimg.cn/blog_migrate/f2a4e800d92d3de789130da95fafa1e7.png)
3. 同时stack多组变量
# 读取movie数据集,选取所有演员名和其Facebook likes
In[18]: movie = pd.read_csv('data/movie.csv')
actor = movie[['movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name',
'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']]
actor.head()
out[18]:
![img_8bf454529a80707d1a5e7bbc069ea209.png](https://i-blog.csdnimg.cn/blog_migrate/bb6ade7e5b12660a4bfa9865247aae91.png)
# 创建一个自定义函数,用来改变列名。wide_to_long要求分组的变量要有相同的数字结尾:
In[19]: def change_col_name(col_name):
col_name = col_name.replace('_name', '')
if 'facebook' in col_name:
fb_idx = col_name.find('facebook')
col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
return col_name
In[20]: actor2 = actor.rename(columns=change_col_name)
actor2.head()
out[20]:
![img_064e8cde7c7733a40ed99b9db861e3f6.png](https://i-blog.csdnimg.cn/blog_migrate/c1358c059d31bb78f16dc82e0d0c7d8e.png)
# 使用wide_to_long函数,同时stack两列actor和Facebook
In[21]: stubs = ['actor', 'actor_facebook_likes']
actor2_tidy = pd.wide_to_long(actor2,
stubnames=stubs,
i=['movie_title'],
j='actor_num',
sep='_').reset_index()
actor2_tidy.head()
out[21]:
![img_11b986e4116caf0331bb4247728129f1.png](https://i-blog.csdnimg.cn/blog_migrate/b22d485080e8fd0a87a5818e53465efe.png)
更多
# 加载数据
In[22]: df = pd.read_csv('data/stackme.csv')
df
out[22]:
![img_6729997e781e5d68da35fad7e578f624.png](https://i-blog.csdnimg.cn/blog_migrate/e261ec1770855f1a523b06cf51cf79c9.png)
# 对列重命名
In[23]: df2 = df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',
'd':'group2_a1', 'e':'group2_b2'})
df2
out[23]:
![img_9e814d3e3e885caec88a811eff3afa49.png](https://i-blog.csdnimg.cn/blog_migrate/caa3fbf6270c039c9ecccbb8481e0802.png)
# 设定stubnames=['group1', 'group2'],对任何数字都起作用
In[24]: pd.wide_to_long(df2,
stubnames=['group1', 'group2'],
i=['State', 'Country', 'Test'],
j='Label',
suffix='.+',
sep='_')
out[24]:
![img_08c542d6ecefb0a27b7d6332802b558e.png](https://i-blog.csdnimg.cn/blog_migrate/699fd2484b4a8bd91d36fe0390ce5c63.png)
4. 反转stacked数据
# 读取college数据集,学校名作为行索引,,只选取本科生的列
In[25]: usecol_func = lambda x: 'UGDS_' in x or x == 'INSTNM'
college = pd.read_csv('data/college.csv',
index_col='INSTNM',
usecols=usecol_func)
college.head()
out[25]:
![img_0b44794b8eb62b4f49c01a4ecc23772c.png](https://i-blog.csdnimg.cn/blog_migrate/1f4acd9ea588c7cfe228d874b0cf5281.png)
# 用stack方法,将所有水平列名,转化为垂直的行索引
In[26]: college_stacked = college.stack()
college_stacked.head(18)
out[26]: INSTNM
Alabama A & M University UGDS_WHITE 0.0333
UGDS_BLACK 0.9353
UGDS_HISP 0.0055
UGDS_ASIAN 0.0019
UGDS_AIAN 0.0024
UGDS_NHPI 0.0019
UGDS_2MOR 0.0000
UGDS_NRA 0.0059
UGDS_UNKN 0.0138
University of Alabama at Birmingham UGDS_WHITE 0.5922
UGDS_BLACK 0.2600
UGDS_HISP 0.0283
UGDS_ASIAN 0.0518
UGDS_AIAN 0.0022
UGDS_NHPI 0.0007
UGDS_2MOR 0.0368
UGDS_NRA 0.0179
UGDS_UNKN 0.0100
dtype: float64
# unstack方法可以将其还原
In[27]: college_stacked.unstack().head()
out[27]:
![img_9ca60dd71b3f8a2a14ef24bd408705bb.png](https://i-blog.csdnimg.cn/blog_migrate/fd1d1cc0fd38459c7214db2bff8d8b6b.png)
# 另一种方式是先用melt,再用pivot。先加载数据,不指定行索引名
In[28]: college2 = pd.read_csv('data/college.csv',
usecols=usecol_func)
college2.head()
out[28]:
![img_43b2858d8b59fb834e50ade942e1333d.png](https://i-blog.csdnimg.cn/blog_migrate/65c6ff2516ab88bc2e448d776c05e81a.png)
# 使用melt,将所有race列变为一列
In[29]: college_melted = college2.melt(id_vars='INSTNM',
var_name='Race',
value_name='Percentage')
college_melted.head()
out[29]:
![img_41131f74d48d9a3112ade43499814ef5.png](https://i-blog.csdnimg.cn/blog_migrate/7ed8b9a13dfc1056e6e21a79a150cd0a.png)
# 用pivot还原
In[30]: melted_inv = college_melted.pivot(index='INSTNM',
columns='Race',
values='Percentage')
melted_inv.head()
out[30]:
![img_22e9a3eba50b22cf8dd7ba03d702441a.png](https://i-blog.csdnimg.cn/blog_migrate/58d629826aa5ad4d037df0ab1a857aea.png)
# 用loc同时选取行和列,然后重置索引,可以获得和原先索引顺序一样的DataFrame
In[31]: college2_replication = melted_inv.loc[college2['INSTNM'],
college2.columns[1:]]\
.reset_index()
college2.equals(college2_replication)
out[31]: True
更多
# 使用最外层的行索引做unstack
In[32]: college.stack().unstack(0)
out[32]:
![img_4f5e372ffef93072496222e733221299.png](https://i-blog.csdnimg.cn/blog_migrate/f9f1e5c4a52a5d1c5e462ed117eb93aa.png)
# 转置DataFrame更简单的方法是transpose()或T
In[33]: college.T
out[33]:
![img_1331447af978d00252372edbc4da20b7.png](https://i-blog.csdnimg.cn/blog_migrate/1ab1aba7e87c0f71fb124cf5f841119d.png)
5. 分组聚合后unstacking
# 读取employee数据集,求出每个种族的平均工资
In[34]: employee = pd.read_csv('data/employee.csv')
In[35]: employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)
out[35]: RACE
American Indian or Alaskan Native 60272
Asian/Pacific Islander 61660
Black or African American 50137
Hispanic/Latino 52345
Others 51278
White 64419
Name: BASE_SALARY, dtype: int64
# 对种族和性别分组,求平均工资
In[36]: agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().astype(int)
agg
out[36]: RACE GENDER
American Indian or Alaskan Native Female 60238
Male 60305
Asian/Pacific Islander Female 63226
Male 61033
Black or African American Female 48915
Male 51082
Hispanic/Latino Female 46503
Male 54782
Others Female 63785
Male 38771
White Female 66793
Male 63940
Name: BASE_SALARY, dtype: int64
# 对索引层GENDER做unstack
In[37]: agg.unstack('GENDER')
out[37]:
![img_18430579f6836593b336ebd425bbd1f2.png](https://i-blog.csdnimg.cn/blog_migrate/dd54bdd5b2603a6aa664eb54e08c63b6.png)
# 对索引层RACE做unstack
In[38]: agg.unstack('RACE')
out[38]:
![img_671934ec3c782798f4fda44a7d86b354.png](https://i-blog.csdnimg.cn/blog_migrate/61f6799b5cdcc62893fcd006fbdb1c8a.png)
更多
# 按RACE和GENDER分组,求工资的平均值、最大值和最小值
In[39]: agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].agg(['mean', 'max', 'min']).astype(int)
agg2
out[39]:
![img_1646b3d3955b0427578ea983216de8a7.png](https://i-blog.csdnimg.cn/blog_migrate/61a78ca8362d8d7a461044100efa17d5.png)
# 此时unstack('GENDER')会生成多级列索引,可以用stack和unstack调整结构
agg2.unstack('GENDER')
![img_a220ae236fd495a8b4a957f7c011ed60.png](https://i-blog.csdnimg.cn/blog_migrate/fdfabfbcde2713d052d773db84f6c0fe.png)
6. 用分组聚合实现透视表
# 读取flights数据集
In[40]: flights = pd.read_csv('data/flights.csv')
flights.head()
out[40]:
![img_296493bccc4bafe56b5634f469a8bbdd.png](https://i-blog.csdnimg.cn/blog_migrate/edaa0613627c0ed838e63dfc39244fa4.png)
# 用pivot_table方法求出每条航线每个始发地的被取消的航班总数
In[41]: fp = flights.pivot_table(index='AIRLINE',
columns='ORG_AIR',
values='CANCELLED',
aggfunc='sum',
fill_value=0).round(2)
fp.head()
out[41]:
![img_1de5d72715b388f7cbf0da8520bfa454.png](https://i-blog.csdnimg.cn/blog_migrate/c4d454b776da9cffee3c4c367a68530e.png)
# groupby聚合不能直接复现这张表。需要先按所有index和columns的列聚合
In[42]: fg = flights.groupby(['AIRLINE', 'ORG_AIR'])['CANCELLED'].sum()
fg.head()
out[42]: AIRLINE ORG_AIR
AA ATL 3
DEN 4
DFW 86
IAH 3
LAS 3
Name: CANCELLED, dtype: int64
# 再使用unstack,将ORG_AIR这层索引作为列名
In[43]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
fg_unstack.head()
out[43]:
![img_e5dcc001b5c7be1d5d98d77f19680171.png](https://i-blog.csdnimg.cn/blog_migrate/9e26fddc395c84cde716387d3d351298.png)
# 判断两个方式是否等价
In[44]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
fp.equals(fg_unstack)
out[44]: True
更多
# 先实现一个稍微复杂的透视表
In[45]: fp2 = flights.pivot_table(index=['AIRLINE', 'MONTH'],
columns=['ORG_AIR', 'CANCELLED'],
values=['DEP_DELAY', 'DIST'],
aggfunc=[np.mean, np.sum],
fill_value=0)
fp2.head()
out[45]:
![img_24717b5a18b558c4466bb8c0de393704.png](https://i-blog.csdnimg.cn/blog_migrate/ebe362188638052244e9e06cb9ab4a99.png)
# 用groupby和unstack复现上面的方法
In[46]: flights.groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED'])['DEP_DELAY', 'DIST'] \
.agg(['mean', 'sum']) \
.unstack(['ORG_AIR', 'CANCELLED'], fill_value=0) \
.swaplevel(0, 1, axis='columns') \
.head()
out[46]:
![img_e96b0f7c6020040db3a3338380d20106.png](https://i-blog.csdnimg.cn/blog_migrate/585404b4cf6691185c875bfe24d99e0b.png)
7. 为了更容易reshaping,重新命名索引层
# 读取college数据集,分组后,统计本科生的SAT数学成绩信息
In[47]: college = pd.read_csv('data/college.csv')
In[48]: cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'] \
.agg(['count', 'min', 'max']).head(6)
In[49]: cg
out[49]:
![img_7c16093b3337aba617fce9fc39f16e7a.png](https://i-blog.csdnimg.cn/blog_migrate/4c16a330651b9232ac67cd50a9c4dc90.png)
# 行索引的两级都有名字,而列索引没有名字。用rename_axis给列索引的两级命名
In[50]:cg = cg.rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
cg
out[50]:
![img_ef31aec4c4643b4d6b4056c82ea15f52.png](https://i-blog.csdnimg.cn/blog_migrate/ee524fcadb0a25a74c40f175c047d8c2.png)
# 将AGG_FUNCS列移到行索引
In[51]:cg.stack('AGG_FUNCS').head()
out[51]:
![img_f21ec1ce9c3b51c54ab5b7e259b96bd6.png](https://i-blog.csdnimg.cn/blog_migrate/a8c531aa68670d682cb0a0e7902db156.png)
# stack默认是将列放到行索引的最内层,可以使用swaplevel改变层级
In[52]:cg.stack('AGG_FUNCS').swaplevel('AGG_FUNCS', 'STABBR', axis='index').head()
out[52]:
![img_d40429b6e55a951a02e55969f11e2a4e.png](https://i-blog.csdnimg.cn/blog_migrate/ec135b01c0bccd8b49645c798d28946b.png)
# 在此前的基础上再做sort_index
In[53]:cg.stack('AGG_FUNCS') \
.swaplevel('AGG_FUNCS', 'STABBR', axis='index') \
.sort_index(level='RELAFFIL', axis='index') \
.sort_index(level='AGG_COLS', axis='columns').head(6)
out[53]:
![img_127bad7037d16fa21f14c61ff14bb9ce.png](https://i-blog.csdnimg.cn/blog_migrate/a2b5e76648fddd1fdf24e6c290bbc5f8.png)
# 对一些列做stack,对其它列做unstack
In[54]:cg.stack('AGG_FUNCS').unstack(['RELAFFIL', 'STABBR'])
out[54]:
![img_6c107b3dc60a2a6895d03d132c399b18.png](https://i-blog.csdnimg.cn/blog_migrate/d01958859f8869ad978752c635ec7011.png)
# 对所有列做stack,会返回一个Series
In[55]:cg.stack(['AGG_FUNCS', 'AGG_COLS']).head(12)
out[55]:
![img_bfbee4a739fcfead844df6e76795d99c.png](https://i-blog.csdnimg.cn/blog_migrate/2981522b9a09c9eb6052cb66b3ec2d5b.png)
更多
# 删除行和列索引所有层级的名称
In[56]:cg.rename_axis([None, None], axis='index').rename_axis([None, None], axis='columns')
out[56]:
![img_a6136532a212cd33a15400508be7d815.png](https://i-blog.csdnimg.cn/blog_migrate/f8da0d2e73ecb94bd67146d38c57783c.png)
8. 当多个变量被存储为列名时进行清理
# 读取weightlifting数据集
In[57]:weightlifting = pd.read_csv('data/weightlifting_men.csv')
weightlifting
out[57]:
![img_d993714941845282cd6809e898141ba7.png](https://i-blog.csdnimg.cn/blog_migrate/413480080972d5bdd2b696c861f1f72d.png)
# 用melt方法,将sex_age放入一个单独的列
In[58]:wl_melt = weightlifting.melt(id_vars='Weight Category',
var_name='sex_age',
value_name='Qual Total')
wl_melt.head()
out[58]:
![img_5187951e5c0812bcce79de78dd1dcc7e.png](https://i-blog.csdnimg.cn/blog_migrate/398f3d8b5e81881e4f80fe1e415db17a.png)
# 用split方法将sex_age列分为两列
In[59]:sex_age = wl_melt['sex_age'].str.split(expand=True)
sex_age.head()
out[59]: 0 1
0 M35 35-39
1 M35 35-39
2 M35 35-39
3 M35 35-39
4 M35 35-39
# 给列起名
In[60]:sex_age.columns = ['Sex', 'Age Group']
sex_age.head()
out[60]:
![img_656de127844aa1595c7af6ffc5e6e047.png](https://i-blog.csdnimg.cn/blog_migrate/162789ddd0cb0efacd55a222edb2ce6e.png)
# 只取出字符串中的M
In[61]:sex_age['Sex'] = sex_age['Sex'].str[0]
sex_age.head()
out[61]:
![img_e59686f91277a56c1aa806ccf06ab105.png](https://i-blog.csdnimg.cn/blog_migrate/f7a04633f68d44713a7305277efa08c1.png)
# 用concat方法,将sex_age,与wl_cat_total连接起来
In[62]:wl_cat_total = wl_melt[['Weight Category', 'Qual Total']]
wl_tidy = pd.concat([sex_age, wl_cat_total], axis='columns')
wl_tidy.head()
out[62]:
![img_cdf5def4fa80bb37cd93420032fd4cd9.png](https://i-blog.csdnimg.cn/blog_migrate/4421f78e5bdd05b841a02a2f0132be11.png)
# 上面的结果也可以如下实现
In[63]:cols = ['Weight Category', 'Qual Total']
sex_age[cols] = wl_melt[cols]
更多
# 也可以通过assign的方法,动态加载新的列
In[64]: age_group = wl_melt.sex_age.str.extract('(\d{2}[-+](?:\d{2})?)', expand=False)
sex = wl_melt.sex_age.str[0]
new_cols = {'Sex':sex,
'Age Group': age_group}
In[65]: wl_tidy2 = wl_melt.assign(**new_cols).drop('sex_age', axis='columns')
wl_tidy2.head()
out[65]:
![img_f649670bd79964295d6a62a9f38f41d4.png](https://i-blog.csdnimg.cn/blog_migrate/36b723b5f0721bfdb7fb078b76a71ff7.png)
# 判断两种方法是否等效
In[66]: wl_tidy2.sort_index(axis=1).equals(wl_tidy.sort_index(axis=1))
out[66]: True
9. 当多个变量被存储为列的值时进行清理
# 读取restaurant_inspections数据集,将Date列的数据类型变为datetime64
In[67]: inspections = pd.read_csv('data/restaurant_inspections.csv', parse_dates=['Date'])
inspections.head(10)
out[67]:
![img_cf128dd4be6626799c3744ea24562fb3.png](https://i-blog.csdnimg.cn/blog_migrate/8c9fbb3069598508125c650ccf1e8292.png)
# 用info列的所有值造一个新列。但是,Pandas不支持这种功能
In[68]: inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
297 try:
--> 298 codes, categories = factorize(values, sort=True)
299 except TypeError:
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
559 check_nulls = not is_integer_dtype(original)
--> 560 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
561
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21922)()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
During handling of the above exception, another exception occurred:
NotImplementedError Traceback (most recent call last)
<ipython-input-68-754f69d68d6c> in <module>()
----> 1 inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
3851 """
3852 from pandas.core.reshape.reshape import pivot
-> 3853 return pivot(self, index=index, columns=columns, values=values)
3854
3855 def stack(self, level=-1, dropna=True):
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
375 index = self[index]
376 indexed = Series(self[values].values,
--> 377 index=MultiIndex.from_arrays([index, self[columns]]))
378 return indexed.unstack(columns)
379
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in from_arrays(cls, arrays, sortorder, names)
1098 from pandas.core.categorical import _factorize_from_iterables
1099
-> 1100 labels, levels = _factorize_from_iterables(arrays)
1101 if names is None:
1102 names = [getattr(arr, "name", None) for arr in arrays]
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
2191 # For consistency, it should return a list of 2 lists.
2192 return [[], []]
-> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in <listcomp>(.0)
2191 # For consistency, it should return a list of 2 lists.
2192 return [[], []]
-> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
2163 codes = values.codes
2164 else:
-> 2165 cat = Categorical(values, ordered=True)
2166 categories = cat.categories
2167 codes = cat.codes
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
308
309 # FIXME
--> 310 raise NotImplementedError("> 1 ndim Categorical are not "
311 "supported at this time")
312
NotImplementedError: > 1 ndim Categorical are not supported at this time
# 将'Name','Date', 'Info'作为航所索引
In[69]: inspections.set_index(['Name','Date', 'Info']).head(10)
out[69]:
![img_349eae336626fe62ef42f3efe514d933.png](https://i-blog.csdnimg.cn/blog_migrate/a1342626c1593419fd2c7035b5eef513.png)
# 用pivot,将info列中的值变为新的列
In[70]: inspections.set_index(['Name','Date', 'Info']).unstack('Info').head()
out[70]:
![img_3eeed924bfcbd5d77fdc48b605ff58e4.png](https://i-blog.csdnimg.cn/blog_migrate/3acb9b1e900c438a0e838268078a2ddb.png)
# 用reset_index方法,使行索引层级与列索引相同
In[71]: insp_tidy = inspections.set_index(['Name','Date', 'Info']) \
.unstack('Info') \
.reset_index(col_level=-1)
insp_tidy.head()
out[71]:
![img_563fe5bd25e95f35d66d0a76669c7cdc.png](https://i-blog.csdnimg.cn/blog_migrate/a36e92f118f89db5fcf64202c1ffac36.png)
# 除掉列索引的最外层,重命名行索引的层为None
In[72]: insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
insp_tidy.head()
out[72]:
![img_50d4d2cb0c3ea71d7135a88f28113aa3.png](https://i-blog.csdnimg.cn/blog_migrate/74df9fa5beb31bb8cf2be86b2974dde4.png)
# 使用squeeze方法,可以避免前面的多级索引
In[73]: inspections.set_index(['Name','Date', 'Info']) \
.squeeze() \
.unstack('Info') \
.reset_index() \
.rename_axis(None, axis='columns')
out[73]:
![img_f16f9edc08e2ff11a16d44c1f0c2c626.png](https://i-blog.csdnimg.cn/blog_migrate/0fc5cc630aec58aae0d2ac0855652a84.png)
更多
# pivot_table需要传入聚合函数,才能产生一个单一值
In[74]: inspections.pivot_table(index=['Name', 'Date'],
columns='Info',
values='Value',
aggfunc='first') \
.reset_index()\
.rename_axis(None, axis='columns')
out[74]:
![img_b0adf6afb14d7983b6ec95bd9b55f7bf.png](https://i-blog.csdnimg.cn/blog_migrate/656af68f7e73edec5f890e7b8e75fffd.png)
10. 当两个或多个值存储于一个单元格时进行清理
# 读取texas_cities数据集
In[75]: cities = pd.read_csv('data/texas_cities.csv')
cities
out[75]:
![img_fc3c5bc0232329c1202fcb48530e89d3.png](https://i-blog.csdnimg.cn/blog_migrate/164791a5feeedc8a41d3e944749df96b.png)
# 将Geolocation分解为四个单独的列
In[76]: geolocations = cities.Geolocation.str.split(pat='. ', expand=True)
geolocations.columns = ['latitude', 'latitude direction', 'longitude', 'longitude direction']
geolocations
out[76]:
![img_cd57baf36c50b2e4c05e91ac0c8f99ab.png](https://i-blog.csdnimg.cn/blog_migrate/0f0b9649589b5d801a5e2d50b3172ac1.png)
# 转变数据类型
In[77]: geolocations = geolocations.astype({'latitude':'float', 'longitude':'float'})
geolocations.dtypes
out[77]: latitude float64
latitude direction object
longitude float64
longitude direction object
dtype: object
# 将新列与原先的city列连起来
In[78]: cities_tidy = pd.concat([cities['City'], geolocations], axis='columns')
cities_tidy
out[78]:
![img_d2c04a464a195d5379fe9bbd6a2f1d82.png](https://i-blog.csdnimg.cn/blog_migrate/00841a8865e8783d32053ff8d24091d2.png)
# 忽略,作者这里是写重复了
In[79]: pd.concat([cities['City'], geolocations], axis='columns')
out[79]:
![img_d49a91df7ce4677988629d8472f56f18.png](https://i-blog.csdnimg.cn/blog_migrate/c1e188f66d70ae189519ada5c4f9259c.png)
原理
# 函数to_numeric可以将每列自动变为整数或浮点数
In[80]: temp = geolocations.apply(pd.to_numeric, errors='ignore')
temp
out[80]:
![img_400593fef91d2a7b5f232577ceb1286c.png](https://i-blog.csdnimg.cn/blog_migrate/b3a64e077b56b542ec5a28825d2746d7.png)
# 再查看数据类型
In[81]: temp.dtypes
out[81]: latitude float64
latitude direction object
longitude float64
longitude direction object
dtype: object
更多
# |符,可以对多个标记进行分割
In[82]: cities.Geolocation.str.split(pat='° |, ', expand=True)
out[82]:
![img_459548a2e53417b5501fbfa93250785a.png](https://i-blog.csdnimg.cn/blog_migrate/d87efc3a0528fea9182fcfd67c35e1ce.png)
# 更复杂的提取方式
In[83]: cities.Geolocation.str.extract('([0-9.]+). (N|S), ([0-9.]+). (E|W)', expand=True)
out[83]:
![img_27114c0e735d1e8b12a3682e74031568.png](https://i-blog.csdnimg.cn/blog_migrate/869603be5a34613a7ff48e445873abfe.png)
11. 当多个变量被存储为列名和列值时进行清理
# 读取sensors数据集
In[84]: sensors = pd.read_csv('data/sensors.csv')
sensors
out[84]:
![img_dea4c547928b51aa150c2df33e8933d3.png](https://i-blog.csdnimg.cn/blog_migrate/8378b5487ada8bf31d4c4afa7e058568.png)
# 用melt清理数据
In[85]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year').head(6)
out[85]:
![img_7f7c1234d6aff0d89362a3124753363e.png](https://i-blog.csdnimg.cn/blog_migrate/fa9afb8af397d0808dda262a2480929a.png)
# 用pivot_table,将Property列转化为新的列名
In[86]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year') \
.pivot_table(index=['Group', 'Year'], columns='Property', values='value') \
.reset_index() \
.rename_axis(None, axis='columns')
out[86]:
![img_99fe0a44c1459bec4faca935b8f10dfd.png](https://i-blog.csdnimg.cn/blog_migrate/310160db556999f9108db46121b3933c.png)
更多
# 用stack和unstack实现上述方法
In[87]: sensors.set_index(['Group', 'Property']) \
.stack() \
.unstack('Property') \
.rename_axis(['Group', 'Year'], axis='index') \
.rename_axis(None, axis='columns') \
.reset_index()
out[87]:
![img_8693d0e1d0cbdf0a25650324c683139b.png](https://i-blog.csdnimg.cn/blog_migrate/3caffb11a70297fd9d791e4a877a0429.png)
12. 当多个观察单位被存储于同一张表时进行清理
# 读取movie_altered数据集
In[88]: movie = pd.read_csv('data/movie_altered.csv')
movie.head()
out[88]:
![img_89a13de31564c2e520cbc6693cdcb33c.png](https://i-blog.csdnimg.cn/blog_migrate/3f59bcbf64c94db0eaf6c687afbdf71a.png)
# 插入新的列,用来标识每一部电影
In[89]: movie.insert(0, 'id', np.arange(len(movie)))
movie.head()
out[89]:
![img_a6ac3f5422fd31f75cfc9e71b293e063.png](https://i-blog.csdnimg.cn/blog_migrate/55c315f63be889c1fe5a25778fdd5f6a.png)
# 用wide_to_long,将所有演员放到一列,将所有Facebook likes放到一列
In[90]: stubnames = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes']
movie_long = pd.wide_to_long(movie,
stubnames=stubnames,
i='id',
j='num',
sep='_').reset_index()
movie_long['num'] = movie_long['num'].astype(int)
movie_long.head(9)
out[90]:
![img_067ca53e39c994225801a9bd840e55d7.png](https://i-blog.csdnimg.cn/blog_migrate/ab51b5282b301baf9b29cc42ed2c90cf.png)
# 将这个数据分解成多个小表
In[91]: movie_table = movie_long[['id','title', 'year', 'duration', 'rating']]
director_table = movie_long[['id', 'director', 'num', 'director_fb_likes']]
actor_table = movie_long[['id', 'actor', 'num', 'actor_fb_likes']]
In[92]: movie_table.head(9)
out[90]:
![img_3c308a87d20cd193715510e34f09fd8e.png](https://i-blog.csdnimg.cn/blog_migrate/a1e0cce9d5d89e039732c30fcfc75e46.png)
In[93]: director_table.head(9)
out[93]:
![img_4765b51d51efa8e1541d249d5a874cd2.png](https://i-blog.csdnimg.cn/blog_migrate/aaeb6df296335ee1864777885c21b402.png)
In[94]: actor_table.head(9)
out[94]:
![img_cb6c1449635a5edf6d9bd951a85fea06.png](https://i-blog.csdnimg.cn/blog_migrate/d81c8a0c4027082a10cfe6ec20eb1872.png)
# 做一些去重和去除缺失值的工作
In[95]: movie_table = movie_table.drop_duplicates().reset_index(drop=True)
director_table = director_table.dropna().reset_index(drop=True)
actor_table = actor_table.dropna().reset_index(drop=True)
In[96]: movie_table.head()
out[96]:
![img_226df587a1d92ac669a04f2eb2f7e053.png](https://i-blog.csdnimg.cn/blog_migrate/957bd28652131c86b32a47b0ac36eb63.png)
In[97]: director_table.head()
out[97]:
![img_4518638ab5ea4bfe5c351c49bc7a811d.png](https://i-blog.csdnimg.cn/blog_migrate/2bc57ea0706fa4d1219a682372af6f56.png)
# 比较内存的使用量
In[98]: movie.memory_usage(deep=True).sum()
out[98]: 2318234
In[99]: movie_table.memory_usage(deep=True).sum() + \
director_table.memory_usage(deep=True).sum() + \
actor_table.memory_usage(deep=True).sum()
out[99]: 2624898
# 创建演员和导演的id列
In[100]: director_cat = pd.Categorical(director_table['director'])
director_table.insert(1, 'director_id', director_cat.codes)
actor_cat = pd.Categorical(actor_table['actor'])
actor_table.insert(1, 'actor_id', actor_cat.codes)
director_table.head()
out[100]:
![img_82ba1521420e1bb830277dc59dcf943c.png](https://i-blog.csdnimg.cn/blog_migrate/f4c7ae4a500212671a66ec7e607d1f20.png)
In[101]: actor_table.head()
out[101]:
![img_33fcfcb9abb3b1489a1c6a22be3a5fde.png](https://i-blog.csdnimg.cn/blog_migrate/62bf033a7c3c9cff4b9d5d2a58822c7a.png)
# 可以用这两张表生成要用的中间表。先来做director表
In[102]: director_associative = director_table[['id', 'director_id', 'num']]
dcols = ['director_id', 'director', 'director_fb_likes']
director_unique = director_table[dcols].drop_duplicates().reset_index(drop=True)
director_associative.head()
out[102]:
![img_2e10466c0b4d84a1ce8a3fc111e25d71.png](https://i-blog.csdnimg.cn/blog_migrate/bfdb880f24c19131f983b0b69074d6bb.png)
In[103]: director_unique.head()
out[103]:
![img_d8bc7002b0537250a6c6ea3600f6ee1b.png](https://i-blog.csdnimg.cn/blog_migrate/b37b4bd7d76af1fcdbe07e8a1edd6ac2.png)
# 再来做actor表
In[104]: actor_associative = actor_table[['id', 'actor_id', 'num']]
acols = ['actor_id', 'actor', 'actor_fb_likes']
actor_unique = actor_table[acols].drop_duplicates().reset_index(drop=True)
actor_associative.head()
out[104]:
![img_a87ee10562958b84cb90ee8dc4dd15d2.png](https://i-blog.csdnimg.cn/blog_migrate/6d863f62edb063b489d8974b35e4e051.png)
In[105]: actor_unique.head()
out[105]:
![img_3f9b268c20c50d05f16d5d31e913ccd1.png](https://i-blog.csdnimg.cn/blog_migrate/d46bce916c4fcf705d062907f8984dff.png)
# 查看新的表所使用的内存量
In[106]: movie_table.memory_usage(deep=True).sum() + \
director_associative.memory_usage(deep=True).sum() + \
director_unique.memory_usage(deep=True).sum() + \
actor_associative.memory_usage(deep=True).sum() + \
actor_unique.memory_usage(deep=True).sum()
out[106]: 1833402
In[107]: movie_table.head()
out[107]:
![img_022c5c526cc19e924e571fc9ebc3511c.png](https://i-blog.csdnimg.cn/blog_migrate/f1b89978dd0acb17f1c3b8ba10f44c9f.png)
# 可以通过将左右表组合起来形成movie表。首先将附表与actor/director表结合,然后将num列pivot,再加上列的前缀
In[108]: actors = actor_associative.merge(actor_unique, on='actor_id') \
.drop('actor_id', 1) \
.pivot_table(index='id', columns='num', aggfunc='first')
actors.columns = actors.columns.get_level_values(0) + '_' + \
actors.columns.get_level_values(1).astype(str)
directors = director_associative.merge(director_unique, on='director_id') \
.drop('director_id', 1) \
.pivot_table(index='id', columns='num', aggfunc='first')
directors.columns = directors.columns.get_level_values(0) + '_' + \
directors.columns.get_level_values(1).astype(str)
In[109]: actors.head()
out[109]:
![img_35e9403bd0497badeb227daa98b00bcf.png](https://i-blog.csdnimg.cn/blog_migrate/8c3372fd7f7c532a91836c1f62ae5cb5.png)
In[110]: directors.head()
out[110]:
![img_ec9b421908225c45edba443839e920c5.png](https://i-blog.csdnimg.cn/blog_migrate/8800bd0124d9affa7cfbe89c835d3b75.png)
In[111]: movie2 = movie_table.merge(directors.reset_index(), on='id', how='left') \
.merge(actors.reset_index(), on='id', how='left')
In[112]: movie2.head()
out[112]:
![img_96b0724a5564d5f4977ab7c221448052.png](https://i-blog.csdnimg.cn/blog_migrate/a72f3c15d78ff2c275b0bab63a4043b1.png)
In[113]: movie.equals(movie2[movie.columns])
out[113]: True
第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化