对《利用Python 进行数据分析》(Wes Mckinney著)一书中的第七章中数组规整化:清理、转换、合并、重塑进行代码实验。原书中采用的是Python2.7,而我采用的Python3.7在Pycharm调试的,因此对书中源代码进行了一定的修改,每步打印结果(除“随机”相关外)与原文校验对照一致(输出结果在注释中,简单的输出就没写结果),全手工敲写,供参考。
Pdf文档和数据集参见:《利用Python 进行数据分析》第二章:引言中的分析代码(含pdf和数据集下载链接)
数组规整化:清理、转换、合并、重塑:
因为代码过长,放在一个代码段中显得冗长,因此进行了拆分,如下的库引入每个代码段中均可能有必要。
# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
1、合并数据集
1.1 数据库风格的DataFrame合并
df1 = DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1':range(7)})
df2 = DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
print(df1)
'''
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
'''
print(df2)
'''
key data2
0 a 0
1 b 1
2 d 2
'''
# 多对一的合并,df1中有多个标记为a和b的行,而df2中key中key列的每个值对应一行
df_merge= pd.merge(df1, df2)
print(df_merge)
'''
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
'''
# 不显示指定使用哪个列进行连接也可以,不过最好显示指定
df_merge= pd.merge(df1, df2, on='key')
print(df_merge)
'''
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
'''
# 如果两个对象的列名不同,也可以分别进行指定
df3 = DataFrame({'lkey':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1':range(7)})
df4 = DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
df_merge = pd.merge(df3, df4, left_on = 'lkey', right_on = 'rkey')
print(df_merge)
'''
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
'''
# merge默认做的是“inner"连接,其他方式还有”left", "right"以及"outer"
df_outer = pd.merge(df1, df2, how='outer')
print(df_outer)
'''
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0'''
df1 = DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],
'data1':range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
'data2': range(5)})
print(df1)
'''
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
'''
print(df2)
'''
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
'''
# 多对多连接产生的是行的笛卡尔积
df_merge = pd.merge(df1, df2, on='key', how='left')
print(df_merge)
'''
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
'''
df_inner = pd.merge(df1, df2, how = 'inner')
print(df_inner)
'''
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
'''
# 根据多个键进行合并,传入一个由列名组成的列表即可
left = DataFrame({'key1':['foo','foo','bar'],
'key2': ['one','two','one'],
'lval':[1, 2, 3]})
right = DataFrame({'key1':['foo','foo','bar','bar'],
'key2': ['one','one','one','two'],
'rval':[4,5,6,7]})
print(left)
'''
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
'''
print(right)
'''
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
'''
# 多个键行程一系列元组,并将其当做单个连接键(当然,实际上并不是这么回事)
df_merge = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print(df_merge)
'''
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
'''
# 重复列名的处理;suffixes选项用于指定附加到左右两个DataFrame对象的重叠列名上的字符串
df_merge = pd.merge(left, right, on='key1')
print(df_merge)
'''
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
'''
df_merge = pd.merge(left, right, on='key1',suffixes=('_left', '_right'))
print(df_merge)
'''
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
'''
1.2 索引上的合并
# 有时, DataFrame中的连接键位于其索引中,可以传入left_index=True或right_index=True一说明应该被用作连接键
left1 = DataFrame({'key':['a', 'b', 'a', 'a', 'b', 'c'],
'value':range(6)})
right1 = DataFrame({'group_val': [3.5, 7]}, index = ['a', 'b'])
print(left1)
'''
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
'''
print(right1)
'''
group_val
a 3.5
b 7.0
'''
df_merge = pd.merge(left1, right1, left_on = 'key', right_index=True)
print(df_merge)
'''
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
'''
df_merge = pd.merge(left1, right1, left_on = 'key', right_index=True, how='outer')
print(df_merge)
'''
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
'''
lefth = DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada', 'Nevada'],
'key2':[2000,2001,2002,2001,2002],
'data':np.arange(5.)})
righth = DataFrame(np.arange(12).reshape((6,2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
[2001,2000,2000,2000,2001,2002]],
columns=['event1','event2'])
print(lefth)
'''
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
'''
print(righth)
'''
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
'''
# 对于层次化索引的数据,必须以列表形式指明用作合并键的多个列(注意对重复索引值的处理)
df_merge=pd.merge(lefth,righth,left_on = ['key1','key2'],right_index=True)
print(df_merge)
'''
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
'''
df_merge=pd.merge(lefth,righth,left_on = ['key1','key2'],right_index=True,how = 'outer')
print(df_merge)
'''
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
'''
left2 = DataFrame([[1.,2.],[3.,4.],[5.,6.]], index=['a','c','e'],
columns=['Ohio','Nevada'])
right2 = DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],
index=['b','c','d','e'],columns=['Missouri','Alabama'])
print(left2)
'''
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
'''
print(right2)
'''
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
'''
# 可以使用双方的索引也没问题
df_merge = pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
print(df_merge)
'''
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
'''
# DataFrame还有一个join示例方法,可以方便地实现按索引合并,上面的例子可以写作
df_merge=left2.join(right2, how='outer')
print(df_merge)
'''
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
'''
left1 = DataFrame({'key':['a', 'b', 'a', 'a', 'b', 'c'],
'value':range(6)})
right1 = DataFrame({'group_val': [3.5, 7]}, index = ['a', 'b'])
df_merge=left1.join(right1, on='key')
print(df_merge)
'''
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
'''
# 对于简单索引合并,可以向join传入一组DataFrame
another = DataFrame([[7.,8.], [9.,10.], [11.,12.],[16.,17.]],
index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])
df_merge = left2.join([right2, another])
print(another)
'''
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
'''
print(df_merge)
'''
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
'''
df_merge = left2.join([right2, another], how='outer')
print(df_merge)
'''
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
b NaN NaN 7.0 8.0 NaN NaN
d NaN NaN 11.0 12.0 NaN NaN
f NaN NaN NaN NaN 16.0 17.0
'''
1.3 轴向连接
# Numpy有一个用于合并原始Numpy数组的concatenation函数
arr = np.arange(12).reshape(3,4)
print(arr)
'''
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
'''
arr_cont=np.concatenate([arr,arr],axis=1)
print(arr_cont)
'''
[[ 0 1 2 3 0 1 2 3]
[ 4 5 6 7 4 5 6 7]
[ 8 9 10 11 8 9 10 11]]
'''
s1 = Series([0,1], index=['a','b'])
s2 = Series([2,3,4], index=['c','d','e'])
s3 = Series([5,6], index=['f','g'])
df_concat = pd.concat([s1,s2,s3])
print(df_concat)
'''
a 0
b 1
c 2
d 3
e 4
f 5
g 6
'''
# 默认concat是在axis=0上工作,最终产生的一个新Series,如果传入axis=1,则结果变成DataFrame
df_concat = pd.concat([s1,s2,s3],axis = 1)
print(df_concat)
'''
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
'''
s4 = pd.concat([s1 * 5, s3])
df_concat = pd.concat([s1,s4], axis = 1)
df_concat2 = pd.concat([s1,s4],axis=1,join='inner')
print(s4)
'''
a 0
b 5
f 5
g 6
'''
print(df_concat)
'''
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6
'''
print(df_concat2)
'''
0 1
a 0 0
b 1 5
'''
# keys参数可以区分链接的片段在结果中区分不开的问题
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
print(result)
'''
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
'''
print(result.unstack())
'''
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
'''
# 如果对axis=1对Series进行合并,则keys就会成为DataFrame的列头
result = pd.concat([s1,s2,s3],axis = 1, keys=['one','two','three'])
print(result)
'''
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
'''
# 上述的逻辑对DataFrame也是一样
df1 = DataFrame(np.arange(6).reshape(3,2), index=['a','b','c'],
columns=['one','two'])
df2 = DataFrame(5+ np.arange(4).reshape(2,2), index=['a','c'],
columns=['three','four'])
df_concat = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
print(df_concat)
'''
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
'''
# 此外还有用于管理层次化索引创建方式的参数
df_concat = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
names = ['upper', 'lower'])
print(df_concat)
'''
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
'''
# 合并时,忽略DataFrame的行索引
df1 = DataFrame(np.random.randn(3,4), columns=['a','b','c','d'])
df2 = DataFrame(np.random.randn(2,3), columns=['b','d','a'])
df_concat = pd.concat([df1,df2], ignore_index=True)
print(df1)
'''
a b c d
0 1.836997 -1.612684 -0.563662 0.903015
1 -0.492099 -2.545862 -0.872793 -0.967434
2 -2.250779 -1.773729 -0.670727 -0.884063
'''
print(df2)
'''
b d a
0 1.323027 2.041850 0.091022
1 -0.018126 -1.705464 0.679508
'''
print(df_concat)
'''
a b c d
0 1.836997 -1.612684 -0.563662 0.903015
1 -0.492099 -2.545862 -0.872793 -0.967434
2 -2.250779 -1.773729 -0.670727 -0.884063
3 0.091022 1.323027 NaN 2.041850
4 0.679508 -0.018126 NaN -1.705464
'''
1.4 合并重叠数据
# 索引全部或部分重叠的两个数据集合并处理
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index = ['f', 'e', 'd', 'c', 'b', 'a'])
b = Series(np.arange(len(a),dtype=np.float64),
index = ['f', 'e', 'd', 'c', 'b', 'a'])
print(a)
'''
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
'''
print(b)
'''
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
'''
df_concat = DataFrame(np.where(pd.isnull(a), b, a), index = b.index)
print(df_concat) # 这里是自己调试的结果。此处,感觉原文档中输出有误
'''
0
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a 5.0
'''
# Series有一个combine_first方法,实现的也是上述一样的功能,而且会进行数据对齐
df_combine = b[:-2].combine_first(a[2:])
print(b[:-2])
print(a[2:])
print(df_combine)
'''
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
'''
# 对于DataFrame, combine_first自然也可以在列上做同样的事情
df1 = DataFrame({'a':[1., np.nan, 5., np.nan],
'b':[np.nan, 2., np.nan, 6.],
'c':range(2, 18, 4)})
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.,]})
df_combine = df1.combine_first(df2)
print(df1)
'''
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
'''
print(df2)
'''
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
'''
print(df_combine)
'''
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
'''
2、重塑和轴向旋转
2.1 重塑层次化索引
# 层次化索引为DataFrame数据的重排任务提供了一个具有良好一致性的方式
# stack:将数据列“旋转”为行;unstack:将数据的行“旋转”为列
data = DataFrame(np.arange(6).reshape((2,3)),
index = pd.Index(['Ohio', 'Colorado'], name = 'state'),
columns = pd.Index(['one','two','three'], name = 'number'))
print(data)
'''
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
'''
# 使用stack方法可以将列转化为行,得到一个Series
result = data.stack()
print(result)
'''
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
'''
# 对于一个层次化索引的Series,可以用unstack将其重排为一个DataFrame
result_un = result.unstack()
print(result_un)
'''
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
'''
# 默认情况,unstack操作的是最内层(stack也是),传入分层级别的编号或名称即可对其他级别进行unstack操作
# 我的理解是,对哪个进行unstack,则将哪个轴作为列
result0 = result.unstack(0)
print(result0)
'''
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
'''
result_state = result.unstack('state')
print(result_state)
'''
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
'''
# 如果不是所有级别都能在各分组中找到的话,则unstack操作可能引入缺失数据
s1 = Series([0,1,2,3], index = ['a', 'b', 'c', 'd'])
s2 = Series([4,5,6],index = ['c','d','e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
print(data2)
'''
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
'''
print(data2.unstack())
'''
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
'''
# stack默认会过滤缺失数据,英雌该运算是可逆的
print(data2.unstack().stack())
'''
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
'''
print(data2.unstack().stack(dropna=False))
'''
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
'''
# 对DataFrame进行unstack()操作时,作为旋转轴的级别将会成为结果中的最低级别
df = DataFrame({'left':result,'right':result + 5},
columns = pd.Index(['left','right'], name = 'side'))
print(df)
'''
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
'''
print(df.unstack())
'''
side left right
number one two three one two three
state
Ohio 0 1 2 5 6 7
Colorado 3 4 5 8 9 10
'''
print(df.unstack('state').stack('side'))
'''
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
'''
2.2 将“长格式” 旋转为”宽格式“
# 这里的ldata为自己生成,与原文不一致,只实验方法
ldata = DataFrame({'date':['1959-03-31','1959-03-31','1959-03-31',
'1959-06-30','1959-06-30','1959-06-30',
'1959-09-30', '1959-09-30','1959-09-30'],
'item':['realgdp', 'infl', 'unemp']*3,
'value':[2710.349, 0.0, 5.8]*3})
# ldata['data'] = pd.to_datetime(ldata['data'],format='%Y/%m/%d') # 将data列转化为datatime类型
print(ldata)
'''
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2710.349
4 1959-06-30 infl 0.000
5 1959-06-30 unemp 5.800
6 1959-09-30 realgdp 2710.349
7 1959-09-30 infl 0.000
8 1959-09-30 unemp 5.800
'''
# data列中的时间值则作为索引
pivoted = ldata.pivot('date','item','value')
print(pivoted)
'''
item infl realgdp unemp
date
1959-03-31 0.0 2710.349 5.8
1959-06-30 0.0 2710.349 5.8
1959-09-30 0.0 2710.349 5.8
'''
# 如果有两个需要参与重塑的数据列
ldata['values2'] = np.random.randn(len(ldata))
print(ldata)
'''
date item value values2
0 1959-03-31 realgdp 2710.349 -1.031220
1 1959-03-31 infl 0.000 0.051459
2 1959-03-31 unemp 5.800 1.968213
3 1959-06-30 realgdp 2710.349 -0.784778
4 1959-06-30 infl 0.000 1.401131
5 1959-06-30 unemp 5.800 0.169799
6 1959-09-30 realgdp 2710.349 0.822280
7 1959-09-30 infl 0.000 0.203341
8 1959-09-30 unemp 5.800 -0.764332
'''
pivoted = ldata.pivot('date', 'item')
print(pivoted)
# pivot其实是一个快捷键而已,用set_index创建层次化索引,再用unstack重塑跟尚不效果一致
unstacked = ldata.set_index(['date','item']).unstack('item')
print(unstacked)
'''
value values2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.0 2710.349 5.8 0.062236 0.272687 1.597471
1959-06-30 0.0 2710.349 5.8 0.817548 0.181683 0.677791
1959-09-30 0.0 2710.349 5.8 0.239968 0.832097 -0.530899
'''
3、数据转换
3.1 移除重复数据
print(['one']*3 + ['two'*4])
data = DataFrame({'k1':['one']*3 + ['two']*4,
'k2': [1,1,2,3,3,4,4]})
print(data)
'''
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
'''
# DataFrame的duplicated()方法返回一个布尔型Series,表示各行是否是重复行
print(data.duplicated())
'''
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
'''
# 使用drop_duplicates方法,返回一个移除了重复行的DataFrame
df_no_duplicate = data.drop_duplicates()
print(df_no_duplicate)
'''
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
'''
# duplicated 和drop_duplicates默认保留第一个出现的值组合,传入'last'参数则保留最后一个
df_no_duplicate = data.drop_duplicates(['k1', 'k2'], 'last') #
print(df_no_duplicate)
'''
k1 k2
1 one 1
2 one 2
4 two 3
6 two 4
'''
3.2 利用函数或映射进行数据转换
data = DataFrame({'food':['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'bacon', 'pastrami', 'honey ham',
'nova lox'],
'ounces':[4,3,12,6,7.5,8,3,5,6]})
print(data)
'''
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
'''
# 添加一列表示该肉类食物来源的动物类型,先编写一个肉类到动物的映射
meat_to_animal={
'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
print(data)
'''
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
'''
# 也可以传入一个lamba函数,跟上面结果一样
data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
3.3 替换值
data = Series([1.,-999.,-2,-999.,-1000.,3.])
print(data)
# 将-999替换为NA值
data_rep = data.replace(-999, np.nan)
print(data_rep)
'''
0 1.0
1 NaN
2 -2.0
3 NaN
4 -1000.0
5 3.0
'''
# 一次性替换多个值,可以传入一个由替换值组成的列表以及一个替换值
data_rep = data.replace([-999, -1000], np.nan)
print(data_rep)
'''
0 1.0
1 NaN
2 -2.0
3 NaN
4 NaN
5 3.0
'''
# 希望对不同的值进行不同的替换,则传入一个由替代关系组成的列表即可
data_rep = data.replace([-999, -1000], [np.nan,0])
print(data_rep)
'''
0 1.0
1 NaN
2 -2.0
3 NaN
4 0.0
5 3.0
'''
# 传入的参数也可以是字典
data_rep = data.replace({-999:np.nan, -1000: 0})
print(data_rep)
'''
0 1.0
1 NaN
2 -2.0
3 NaN
4 0.0
5 3.0
'''
3.4 重命名轴索引
# 跟Series一样,轴标签也可以通过函数或映射进行转换,得到一个新对象
data = DataFrame(np.arange(12).reshape((3,4)),
index = ['Ohio', 'Colorada', 'New York'],
columns = ['one','two', 'three', 'four'])
print(data)
# 轴标签也有一个map方法
print(data.index.map(str.upper)) # Index(['OHIO', 'COLORADA', 'NEW YORK'], dtype='object')
# 将map结果赋值给index,就可以对DataFrame进行修改
data.index = data.index.map(str.upper)
print(data)
'''
one two three four
OHIO 0 1 2 3
COLORADA 4 5 6 7
NEW YORK 8 9 10 11
'''
# 如果不想修改原始数据,则可以使用rename方法
# 将上述数据还原为最开始状态
data.index = data.index.map(str.lower)
data_rename = data.rename(index=str.title, columns=str.upper)
print(data_rename)
'''
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorada 4 5 6 7
New York 8 9 10 11
'''
# rename可以结合自典型对象对部分轴标签更新
data.index = data.index.map(str.upper) # 为了与原文保持一致,索引改为大写
data_rename = data.rename(index={'OHIO':'INDIANA'},
columns={'three':'peekaboo'})
print(data_rename)
'''
one two peekaboo four
INDIANA 0 1 2 3
COLORADA 4 5 6 7
NEW YORK 8 9 10 11
'''
# 就地修改某个数据集,可以使用inplace=True即可
_=data.rename(index={'OHIO':'INDIANA'})
print(data)
'''
one two three four
OHIO 0 1 2 3
COLORADA 4 5 6 7
NEW YORK 8 9 10 11
'''
3.5 离散化和面元划分
# 为了便于分析离散的数据常常被离散化或被拆分为“面元“
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages, bins)
print(cats)
'''
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
'''
# Pandas返回的是一个Categories对象,可以看作一组表示面元名称的字符串
# print(cats.labels)
# print(cats.levels)
print(pd.value_counts(cats))
'''
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
'''
# 可以通过right=False修改区间的左右开闭情况
print(pd.cut(ages, [18,26,36,61,100],right = False))
'''
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
'''
# 可以设置自己的面元名称,将labels选项设置为一个列表或者数据即可
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
print(pd.cut(ages, bins, labels= group_names))
'''
['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']
'''
# 如果cut传入的是面元数量而不是确切的边界,则会根据数据的最小值和最大计算等长面元
data = np.random.randn(20)
print(pd.cut(data,4,precision=2))
'''
[(0.99, 1.95], (-0.93, 0.033], (-0.93, 0.033], (-0.93, 0.033], (-0.93, 0.033], ..., (0.033, 0.99], (0.033, 0.99], (0.033, 0.99], (0.033, 0.99], (-0.93, 0.033]]
Length: 20
Categories (4, interval[float64]): [(-0.93, 0.033] < (0.033, 0.99] < (0.99, 1.95] < (1.95, 2.9]]
'''
# qcut使用的是样本分位数,因此可以得到大小基本相等的面元
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
print(cats)
'''
[(-3.288, -0.669], (-3.288, -0.669], (-3.288, -0.669], (-3.288, -0.669], (0.654, 3.66], ..., (-0.669, -0.0731], (-3.288, -0.669], (-0.669, -0.0731], (0.654, 3.66], (-3.288, -0.669]]
Length: 1000
Categories (4, interval[float64]): [(-3.288, -0.669] < (-0.669, -0.0731] < (-0.0731, 0.654] <
(0.654, 3.66]]
'''
print(pd.value_counts(cats))
'''
(0.654, 3.66] 250
(-0.0731, 0.654] 250
(-0.669, -0.0731] 250
(-3.288, -0.669] 250
'''
# 跟cut一样,qcut也可以设置自定义的分位数(0到1之间的数值,包含端点)
print(pd.qcut(data, [0,0.1,0.5,0.9,1.]))
'''
[(-1.245, -0.0731], (-1.245, -0.0731], (-3.288, -1.245], (-3.288, -1.245], (1.323, 3.66], ..., (-1.245, -0.0731], (-1.245, -0.0731], (-1.245, -0.0731], (-0.0731, 1.323], (-1.245, -0.0731]]
Length: 1000
Categories (4, interval[float64]): [(-3.288, -1.245] < (-1.245, -0.0731] < (-0.0731, 1.323] <
(1.323, 3.66]]
'''
3.6 检测和过滤异常值
# 异常值的过滤或变换运算在很大程度上其实就是数组运算
np.random.seed(12345)
data = DataFrame(np.random.randn(1000,4))
print(data.describe()) #查看数据描述
'''
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
'''
col = data[3]
# 找出某列绝对值大小超过3的值
print(col[np.abs(col) > 3])
'''
97 3.927528
305 -3.399312
400 -3.745356
Name: 3, dtype: float64
'''
# 选出全部含有“超过3或-3的值”的行,可以利用布尔型DataFrame以及any方法
data_select = data[(np.abs(data)>3).any(1)]
print(data_select)
'''
0 1 2 3
5 -0.539741 0.476985 3.248944 -1.021228
97 -0.774363 0.552936 0.106061 3.927528
102 -0.655054 -0.565230 3.176873 0.959533
305 -2.315555 0.457246 -0.025907 -3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990 1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586 0.275144 1.179227 -3.184377 1.369891
808 -0.362528 -3.548824 1.553205 -2.186301
900 3.366626 -2.372214 0.851010 1.332846
'''
# 限制值在区间-3到3以内
data[np.abs(data) > 3] = np.sign(data) *3
print(data.describe())
'''
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067623 0.068473 0.025153 -0.002081
std 0.995485 0.990253 1.003977 0.989736
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.000000 2.653656 3.000000 3.000000
'''
3.7 排列和随机采样
# 通过需要排列的轴的长度调用permutation,可以产生一个表示新顺序的整数数组
sampler = np.random.permutation(5)
print(sampler) # [1 0 2 3 4]
print(df)
'''
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
'''
print(df.take(sampler))
'''
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
'''
# 如果不想用替换的方式选取随机子集,则可以使用permutation,从permutation返回的数组中切下前k个元素
print(df.take(np.random.permutation(len(df))[:3]))
'''
0 1 2 3
1 4 5 6 7
3 12 13 14 15
4 16 17 18 19
'''
# 通过替换的方式产生样本,最快的方式是通过np.random.randint得到一组随机整数
bag = np.array([5, 7, -1, 6, 4])
sampler = np.random.randint(0, len(bag), size = 10)
print(sampler) # [4 4 2 2 2 0 3 0 4 1]
draws = bag.take(sampler)
print(draws) # [ 4 4 -1 -1 -1 5 6 5 4 7]
3.8 计算指标/哑变量
# DataFrame的某一列含有k个不同的值,则可以派生出一个k列矩阵或DataFrame
df = DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],
'data1':range(6)})
print(pd.get_dummies(df['key']))
'''
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
'''
# 在DataFrame的列加上一个前缀,以便可以跟其他数据进行合并
dummies = pd.get_dummies(df['key'], prefix = 'key')
df_with_dumy = df[['data1']].join(dummies)
print(df_with_dumy)
'''
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
'''
# 如果DataFrame中的某个行同属于多个分类,事情有些复杂
mnames = ['movie_id','title','genres']
movies = pd.read_table('python_data/ch02/movielens/movies.dat',sep='::',header = None,
names = mnames)
print(movies[:10])
'''
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
'''
# 对数据规整
genre_iter = (set(x.split('|')) for x in movies.genres)
genres = sorted(set.union(*genre_iter))
# 从全零的DataFrame开始构建DataFrame
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns = genres)
# 迭代每一部电影并将dummies各行的项设置为1
for i, gen in enumerate(movies.genres):
dummies.loc[i, gen.split("|")] = 1 # 这步处理很妙
#
movies_windic = movies.join(dummies.add_prefix('Genre'))
print(movies_windic.iloc[0])
'''
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
GenreAction 0
GenreAdventure 0
GenreAnimation 1
GenreChildren's 1
GenreComedy 1
GenreCrime 0
GenreDocumentary 0
GenreDrama 0
GenreFantasy 0
GenreFilm-Noir 0
GenreHorror 0
GenreMusical 0
GenreMystery 0
GenreRomance 0
GenreSci-Fi 0
GenreThriller 0
GenreWar 0
GenreWestern 0
'''
# 结合get_dummies和诸如cut之类的离散化函数
values = np.random.rand(10)
print(values)
bins = [0,0.2,0.4,0.6,0.8,1]
print(pd.get_dummies(pd.cut(values, bins)))
'''
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 1 0
1 0 0 0 0 1
2 0 0 0 0 1
3 1 0 0 0 0
4 0 0 0 0 1
5 0 0 0 1 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 0 0 1
9 0 0 1 0 0
'''
4、字符串操作
4.1 字符串对象方法
val = 'a,b, guido'
# split拆分字符串
print(val.split(',')) # ['a', 'b', ' guido']
# split常常结合strip(用于修剪空白符(包括换行符))一起使用
pieces = [x.strip() for x in val.split(',')]
print(pieces) # ['a', 'b', 'guido']
# 利用加法可以将这些字符串以双冒号分隔符的形式连接起来
first, second, third = pieces
str_combine = first+'::'+second+"::"+third
print(str_combine) # a::b::guido
# 向字符串“::"的join方法传入一个列表或元组,可以更快地实现上述结果
str_combine = "::".join(pieces)
print(str_combine) # a::b::guido
# 字符串的子串定位,检测最佳方法是python的in关键字(还可以用index和find)
print('guido' in val) # True
print(val.index(',')) # 1 ,表示存在
print(val.find(':')) # -3 表示不存在; rfind() 返回最后一个发现的子串的第一个字符所在位置
# find和index的区别是:如果index找不到,则会引发一个异常
# val.index(":")
'''
Traceback (most recent call last):
File "C:/Users/Yasin/PycharmProjects/energy_save_optimization/chapter7.py", line 1321, in <module>
val.index(":")
ValueError: substring not found
'''
# count函数可以返回指定字符串出现的次数
print(val.count(','))
# replace函数用于指定模式替换为另一个模式;也可以用于删除模式:传入空字符串
print(val.replace(',', '::')) # a::b:: guido
print(val.replace(',','')) # ab guido
4.2 正则表达式
# 正则表达式提供了一种灵活的文本中搜索或匹配字符串模式的方法
# Python内置的re模块负责对字符串应用正则表达式,分三个大类:模式匹配,替换以及拆分
import re
text = "foo bar\t baz \tqux"
print(re.split('\s+', text)) # ['foo', 'bar', 'baz', 'qux']
# re.compile 自己编译regex以得到一个可重用的regex对象
regex = re.compile('\s+')
print(regex.split(text)) # ['foo', 'bar', 'baz', 'qux']
# 如果只希望得到匹配regex的所有模式,则可以使用findall方法
print(regex.findall(text)) # [' ', '\t ', ' \t']
# match和search跟findall功能类似,findall返回的是字符串中所有匹配项,而search则返回第一个匹配项
# match更加严格,只能匹配字符串的首部
text = """Dava dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.INGORECASE的作用是使正则表达式对大小写不敏感
regex = re.compile(pattern, flags = re.IGNORECASE)
ret = regex.findall(text)
print(ret)
'''['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']'''
# search 返回的是文本中第一个电子邮件地址(以特殊的匹配项对象形式返回)
m = regex.search(text)
print(m) # <re.Match object; span=(5, 20), match='dave@google.com'>
print(text[m.start():m.end()]) # dave@google.com
# regex.match将返回None,因为它只匹配出现在字符串开头的模式
print(regex.match(text)) # None
# sub方法将会匹配到的模式替换为指定字符串,并返回所得到的新字符串
print(regex.sub('REDACTED', text))
'''
Dava REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
'''
# 如果不仅想找出电子邮件,还要将个地址分为3个部分,只需要将待分段的部分用圆括号包含起来
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
# re.INGORECASE的作用是使正则表达式对大小写不敏感
regex = re.compile(pattern, flags = re.IGNORECASE)
# groups方法犯规一个由模式各段组成的元组
m = regex.match('wesm@bright.net')
print(m.groups()) # ('wesm', 'bright', 'net')
# 对于带有分组功能的模式,findall会返回一个元组列表
print(regex.findall(text))
'''[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]'''
# sub能通过诸如\1, \2之类的特殊符号访问各匹配项中的分组
print( regex.sub(r'Username:\1, Domain: \2, Suffix \3', text))
'''
Dava Username:dave, Domain: google, Suffix com
Steve Username:steve, Domain: gmail, Suffix com
Rob Username:rob, Domain: gmail, Suffix com
Ryan Username:ryan, Domain: yahoo, Suffix com
'''
4.3 pandas中矢量化的字符串函数
data = {'Dave':'dave@google.com', 'Steve':'steve@gmail.com',
'Rob':'rob@gmail.com', 'Wes':np.nan}
data = Series(data)
print(data)
'''
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
'''
print(data.isnull())
'''
Dave False
Steve False
Rob False
Wes True
dtype: bool
'''
# 通过data.map所有字符串和正则表达式都能应用于各个值,但是有NA存在就会报错
# Series的str属性可以解决这个问题,通过str.contains检查各个电子邮件地址是否含有gami
print(data.str.contains('gmail'))
'''
Dave False
Steve True
Rob True
Wes NaN
'''
print(pattern) # ([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})
print(data.str.findall(pattern, flags = re.IGNORECASE))
'''
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
'''
# 对字符串进行子串截取
print(data.str[:5])
'''
Dave dave@
Steve steve
Rob rob@g
Wes NaN
'''
5、示例:usda视频数据库
import json
db = json.load(open('python_data/ch07/foods-2011-10-03.json'))
print(len(db)) # 6636
# db中每个条目都是一个含有某种事物的全部数据的字典。nutrients是一个字典列表,其中的每个字典对应一种营养成分
print(db[0].keys())
'''
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
'''
print(db[0]['nutrients'][0])
'''
{'value': 25.18, 'units': 'g', 'description': 'Protein', 'group': 'Composition'}
'''
nutrients = DataFrame(db[0]['nutrients'])
print(nutrients[:7])
'''
value units description group
0 25.18 g Protein Composition
1 29.20 g Total lipid (fat) Composition
2 3.06 g Carbohydrate, by difference Composition
3 3.28 g Ash Other
4 376.00 kcal Energy Energy
5 39.28 g Water Composition
6 1573.00 kJ Energy Energy
'''
# 将事物的名称、分类、编号以及制造商等信息取出
info_keys = ['description', 'group', 'id', 'manufacturer']
info = DataFrame(db, columns=info_keys)
print(info[:5])
'''
description ... manufacturer
0 Cheese, caraway ...
1 Cheese, cheddar ...
2 Cheese, edam ...
3 Cheese, feta ...
4 Cheese, mozzarella, part skim milk ...
'''
# 通过value_counts可以查看食物类别的分布情况
print(pd.value_counts(info.group)[:10])
'''
Vegetables and Vegetable Products 812
Beef Products 618
Baked Products 496
Breakfast Cereals 403
Fast Foods 365
Legumes and Legume Products 365
Lamb, Veal, and Game Products 345
Sweets 341
Pork Products 328
Fruits and Fruit Juices 328
'''
nutrients = []
for rec in db:
fnuts = DataFrame(rec['nutrients'])
fnuts['id'] = rec['id']
nutrients.append(fnuts)
nutrients = pd.concat(nutrients, ignore_index = True)
print(nutrients)
'''
value units description group id
0 25.180 g Protein Composition 1008
1 29.200 g Total lipid (fat) Composition 1008
2 3.060 g Carbohydrate, by difference Composition 1008
3 3.280 g Ash Other 1008
4 376.000 kcal Energy Energy 1008
... ... ... ... ... ...
389350 0.000 mcg Vitamin B-12, added Vitamins 43546
389351 0.000 mg Cholesterol Other 43546
389352 0.072 g Fatty acids, total saturated Other 43546
389353 0.028 g Fatty acids, total monounsaturated Other 43546
389354 0.041 g Fatty acids, total polyunsaturated Other 43546
'''
print(nutrients.duplicated().sum()) # 14179
# 去除重复数据
nutrients = nutrients.drop_duplicates()
# DataFrame中有group和description列,因此对列重命名
col_mapping = {'description': 'food',
'group': 'fgroup'}
info =info.rename(columns=col_mapping, copy=False)
print(info)
'''
food ... manufacturer
0 Cheese, caraway ...
1 Cheese, cheddar ...
2 Cheese, edam ...
3 Cheese, feta ...
4 Cheese, mozzarella, part skim milk ...
... ... ... ...
6631 Bologna, beef, low fat ...
6632 Turkey and pork sausage, fresh, bulk, patty or... ...
6633 Babyfood, juice, pear ... None
6634 Babyfood, dessert, banana yogurt, strained ... None
6635 Babyfood, banana no tapioca, strained ... None
'''
col_mapping = {'description': 'nutrient',
'group': 'nutgroup'}
nutrients =nutrients.rename(columns=col_mapping, copy=False)
print(nutrients)
'''
value units nutrient nutgroup id
0 25.180 g Protein Composition 1008
1 29.200 g Total lipid (fat) Composition 1008
2 3.060 g Carbohydrate, by difference Composition 1008
3 3.280 g Ash Other 1008
4 376.000 kcal Energy Energy 1008
... ... ... ... ... ...
389350 0.000 mcg Vitamin B-12, added Vitamins 43546
389351 0.000 mg Cholesterol Other 43546
389352 0.072 g Fatty acids, total saturated Other 43546
389353 0.028 g Fatty acids, total monounsaturated Other 43546
389354 0.041 g Fatty acids, total polyunsaturated Other 43546
'''
ndata = pd.merge(nutrients, info, on='id', how = 'outer')
print(ndata)
'''
value units ... fgroup manufacturer
0 25.180 g ... Dairy and Egg Products
1 29.200 g ... Dairy and Egg Products
2 3.060 g ... Dairy and Egg Products
3 3.280 g ... Dairy and Egg Products
4 376.000 kcal ... Dairy and Egg Products
... ... ... ... ... ...
375171 0.000 mcg ... Baby Foods None
375172 0.000 mg ... Baby Foods None
375173 0.072 g ... Baby Foods None
375174 0.028 g ... Baby Foods None
375175 0.041 g ... Baby Foods None
'''
print(ndata.iloc[30000])
'''
value 0.04
units g
description Glycine
group Amino Acids
id 6158
food Soup, tomato bisque, canned, condensed
fgroup Soups, Sauces, and Gravies
manufacturer
Name: 30000, dtype: object
'''
import matplotlib.pyplot as plt
# 根据营养分类得出的锌中位值
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
rest_sorted = result['Zinc, Zn'].sort_values()
plt.barh(rest_sorted.index, width = rest_sorted) #与原文相比有所更改
plt.savefig('test.png', bbox_inches='tight')
plt.show()
# 找出营养成分最为丰富的食物
by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
get_maximum = lambda x:x.xs(x.value.idxmax())
get_minimun = lambda x:x.xs(x.value.idxmin())
max_foods = by_nutrient.apply(get_maximum)[['value','food']]
max_foods.food = max_foods.food.str[:50]
# 给出“Amino Acids”营养分组
print(max_foods.loc['Amino Acids']['food'])
'''
nutrient
Alanine Gelatins, dry powder, unsweetened
Arginine Seeds, sesame flour, low-fat
Aspartic acid Soy protein isolate
Cystine Seeds, cottonseed flour, low fat (glandless)
Glutamic acid Soy protein isolate
Glycine Gelatins, dry powder, unsweetened
Histidine Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine Fish, cod, Atlantic, dried and salted
Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline Gelatins, dry powder, unsweetened
Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
'''