数据规整化:清理、转换、合并、重塑 《用Python进行数据分析》读书笔记第7章

数据规整化:清理、转换、合并、重塑 第7章

合并数据集

pandas.merge可以根据一个或多个键将不同DataFrame中的行连接起来。

pandas.concat可以沿着一条轴将多个对象堆叠到一起
实例方法combine_first可以将重复数据编接在一起,类似于数据库中的全外连接

数据库风格的DataFrame合并

数据集的合并(merge)或连接(join)运算是通过一个或多个键将行链接起来的。这些运算关系是关系型数据库的核心

import pandas as pd
from pandas import DataFrame,Series
df1=DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
df2=DataFrame({'key':['a','b','d'],'data2':range(3)})
df1
data1key
00b
11b
22a
33c
44a
55a
66b
df2
data2key
00a
11b
22d
#这是一种多对一的合并,df1中的数据有多个被标记为a和b的行,而df2中key列的每个值仅对应一行
pd.merge(df1,df2)
data1keydata2
00b1
11b1
26b1
32a0
44a0
55a0
#默认merge方法会将重叠列的列名当做键合并,最好显式指定
pd.merge(df1,df2,on='key')
data1keydata2
00b1
11b1
26b1
32a0
44a0
55a0
#如果两个对象的列名不同,可以分别进行指定
df3=DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
df4=DataFrame({'rkey':['a','b','d'],'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')
data1lkeydata2rkey
00b1b
11b1b
26b1b
32a0a
44a0a
55a0a
#结果里c和d以及阈值相关的数据消失了,默认情况下,merge做的是'inner'连接,结果中的键
#是交际。其他方式还有‘left’,'right',以及'outer',外连接求取的是键的并集,组合了左连接和右连接
pd.merge(df1,df2,how='outer')
data1keydata2
00.0b1.0
11.0b1.0
26.0b1.0
32.0a0.0
44.0a0.0
55.0a0.0
63.0cNaN
7NaNd2.0
#多对多的合并操作非常简单
df1=DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
df2=DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
df1
data1key
00b
11b
22a
33c
44a
55b
df2
data2key
00a
11b
22a
33b
44d
pd.merge(df1,df2,on='key',how='left')
data1keydata2
00b1.0
10b3.0
21b1.0
31b3.0
42a0.0
52a2.0
63cNaN
74a0.0
84a2.0
95b1.0
105b3.0
#多对多连接产生的是行的笛卡尔积。由于左边的DataFrame有3个“b”行,右边的有2个,所以最终结果中就有
# 6个“b”行,连接方式只影响出现在结果中的键
pd.merge(df1,df2,how='inner')
data1keydata2
00b1
10b3
21b1
31b3
45b1
55b3
62a0
72a2
84a0
94a2
#要对多个键进行合并,传入一个由列名组成的列表即可
left=DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one']
                ,'lval':[1,2,3]})
right=DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two']
                ,'rval':[4,5,6,7]})
pd.merge(left,right,on=['key1','key2'],how='outer')
key1key2lvalrval
0fooone1.04.0
1fooone1.05.0
2footwo2.0NaN
3barone3.06.0
4bartwoNaN7.0
#在进行列-列连接时,DataFrame的索引会被丢弃
#处理列名重叠,可用merge的suffixes选项,用于指定附加到左右两个DataFrame对象的重叠列名上的字符串
pd.merge(left,right,on='key1')
key1key2_xlvalkey2_yrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7
pd.merge(left,right,on='key1',suffixes=('_left','_right'))
key1key2_leftlvalkey2_rightrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7
###索引上的合并 **这种情况下,可用传入left_index=True或right_index=True(或两个都传)以说明索引被用作连接键**
left1=DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
right1=DataFrame({'group_val':[3.5,7]},index=['a','b'])
left1
keyvalue
0a0
1b1
2a2
3a3
4b4
5c5
right1
group_val
a3.5
b7.0
pd.merge(left1,right1,left_on='key',right_index=True)
keyvaluegroup_val
0a03.5
2a23.5
3a33.5
1b17.0
4b47.0
#由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到他们的并集
pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
keyvaluegroup_val
0a03.5
2a23.5
3a33.5
1b17.0
4b47.0
5c5NaN
#对于层次化索引数据
import numpy as np
lefth=DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],
                 'key2':[2000,2001,2002,2001,2002],'data':np.arange(5.0)})
righth=DataFrame(np.arange(12).reshape((6,2)),index=[['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],
                                                     [2001,2000,2000,2000,2001,2002]],columns=['event1','event2'])
lefth
datakey1key2
00.0Ohio2000
11.0Ohio2001
22.0Ohio2002
33.0Nevada2001
44.0Nevada2002
righth
event1event2
Nevada200101
200023
Ohio200045
200067
200189
20021011
#这种情况下,必须以列表的形式知名用作合并键的多个列(注意对重复索引值的处理)
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
datakey1key2event1event2
00.0Ohio200045
00.0Ohio200067
11.0Ohio200189
22.0Ohio20021011
33.0Nevada200101
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
datakey1key2event1event2
00.0Ohio20004.05.0
00.0Ohio20006.07.0
11.0Ohio20018.09.0
22.0Ohio200210.011.0
33.0Nevada20010.01.0
44.0Nevada2002NaNNaN
4NaNNevada20002.03.0
#同时使用合并双方的索引也没问题
left2=DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','c','e'],columns=['Ohio','Nevada'])
right2=DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['b','c','d','e'],columns=['Missouri','Alabama'])
left2
OhioNevada
a1.02.0
c3.04.0
e5.06.0
right2
MissouriAlabama
b7.08.0
c9.010.0
d11.012.0
e13.014.0
pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
OhioNevadaMissouriAlabama
a1.02.0NaNNaN
bNaNNaN7.08.0
c3.04.09.010.0
dNaNNaN11.012.0
e5.06.013.014.0
#DataFrame还有一个join实例方法,它能更为方便地实现按索引合并,还可以用于合并带有多个相同或相似索引的DataFrame对象,而不管
#它们之间有没有重叠的列
left2.join(right2,how="outer")
OhioNevadaMissouriAlabama
a1.02.0NaNNaN
bNaNNaN7.08.0
c3.04.09.010.0
dNaNNaN11.012.0
e5.06.013.014.0
#它还支持参数DataFrame的索引跟调用者DataFrame的某个列之间的连接
left1.join(right1,on='key')
keyvaluegroup_val
0a03.5
1b17.0
2a23.5
3a33.5
4b47.0
5c5NaN
#对于简单的索引合并,可以向join传入一组DataFrame(concat函数更为通用)
another=DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],index=['a','c','e','f'],columns=['New York','Oregon'])
left2.join([right2,another])
OhioNevadaMissouriAlabamaNew YorkOregon
a1.02.0NaNNaN7.08.0
c3.04.09.010.09.010.0
e5.06.013.014.011.012.0
left2.join([right2,another],how='outer')
OhioNevadaMissouriAlabamaNew YorkOregon
a1.02.0NaNNaN7.08.0
bNaNNaN7.08.0NaNNaN
c3.04.09.010.09.010.0
dNaNNaN11.012.0NaNNaN
e5.06.013.014.011.012.0
fNaNNaNNaNNaN16.017.0

轴向连接

另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)

arr=np.arange(12).reshape((3,4))
arr
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
#Numpy函数中的连接
np.concatenate([arr,arr],axis=1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3], [ 4, 5, 6, 7, 4, 5, 6, 7], [ 8, 9, 10, 11, 8, 9, 10, 11]])
#对于pandas对象(如Series和DataFrame),带有标签的轴使你能进一步推广数组的连接运算
s1=Series([0,1],index=['a','b'])
s2=Series([2,3,4],index=['c','d','e'])
s3=Series([5,6],index=['f','g'])
# 将这些对象调用concat可以将值和索引粘合在一起
pd.concat([s1,s2,s3])
a 0 b 1 c 2 d 3 e 4 f 5 g 6 dtype: int64
#默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果是传入axis=1,则结果会变成一个DataFrame(axis=1是列)
pd.concat([s1,s2,s3],axis=1)
012
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0
#这种情况下,另一条轴上没有重叠,从索引的有序并集(外连接)上就可以看出来。传入join='inner'可得到他们的交集
s4=pd.concat([s1*5,s3])
s4
a 0 b 5 f 5 g 6 dtype: int64
pd.concat([s1,s4],axis=1)
01
a0.00
b1.05
fNaN5
gNaN6
pd.concat([s1,s4],axis=1,join='inner')
01
a00
b15
#可以通过join_axes指定要在其他轴上使用的索引
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
01
a0.00.0
cNaNNaN
b1.05.0
eNaNNaN
#如果想在连接轴上创建一个层次化索引,使用keys参数即可达到这个目的
result=pd.concat([s1,s1,s3],keys=['one','two','three'])
result
one a 0 b 1 two a 0 b 1 three f 5 g 6 dtype: int64
result.unstack()
abfg
one0.01.0NaNNaN
two0.01.0NaNNaN
threeNaNNaN5.06.0
#如果沿着axis=1对Series合并,则keys就会成为DataFrame的列头
pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])
onetwothree
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0
#同样的逻辑对DataFrame对象也是一样
df1=DataFrame(np.arange(6).reshape((3,2)),index=['a','b','c'],columns=['one','two'])
df2=DataFrame(5+np.arange(4).reshape((2,2)),index=['a','c'],columns=['three','four'])
pd.concat([df1,df2],axis=1,keys=['level1','level2'])
level1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
#如果传入的不是列表而是一个字典,则字典的键会被当做keys选项的值
pd.concat({'level1':df1,'level2':df2},axis=1)
level1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
#此外还有连个管理层次化索引创建方式的参数
pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'])
upperlevel1level2
loweronetwothreefour
a015.06.0
b23NaNNaN
c457.08.0
#考虑跟当前分析工作无关的DataFrame行索引
df1=DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df2=DataFrame(np.random.randn(2,3),columns=['b','d','a'])
df1
abcd
0-0.0165260.6700451.7300250.710404
1-0.9791420.4754951.0085000.108170
20.173088-0.215524-0.712417-0.224916
df2
bda
01.169713-0.0183490.039875
10.6578770.621230-0.080208
#在这种情况下,传入ignore_index=True即可
pd.concat([df1,df2],ignore_index=True)
abcd
0-0.0165260.6700451.7300250.710404
1-0.9791420.4754951.0085000.108170
20.173088-0.215524-0.712417-0.224916
30.0398751.169713NaN-0.018349
4-0.0802080.657877NaN0.621230

合并重叠数据

有索引全部或部分重叠的两个数据集

a=Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b=Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
a
f NaN e 2.5 d NaN c 3.5 b 4.5 a NaN dtype: float64
b
f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64
np.where(pd.isnull(a),b,a)
array([ 0. , 2.5, 2. , 3.5, 4.5, nan])
#Series有一个combine_first方法,实现的也是一样的功能,而且会进行数据对齐
b[:-2].combine_first(a[2:])
a NaN b 4.5 c 3.0 d 2.0 e 1.0 f 0.0 dtype: float64
#对于DataFrame,combine_first自然也会在列上做同样的事情,因此你可以将其看做:用参数对象的数据为调用者对象的缺失数据‘打补丁’
df1=DataFrame({'a':[1.,np.nan,5.,np.nan],
               'b':[np.nan,2.,np.nan,6.],
               'c':range(2,18,4)})
df2=DataFrame({'a':[5.,4.,np.nan,3.,7.],
               'b':[np.nan,3.,4.,6.,8.]})
df1.combine_first(df2)
abc
01.0NaN2.0
14.02.06.0
25.04.010.0
33.06.014.0
47.08.0NaN
##重塑和轴向旋转 **有许多用于重新排列表格型数据的基础运算。这些函数也称作重塑(reshape)或轴向旋转(pivot)运算**

重塑层次化索引

stack:将数据的列“旋转”为行,unstack将数据的行“旋转”为列

data=DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['Ohio','Colorado'],name='state'),
               columns=pd.Index(['one','two','three'],name='number'))
data
numberonetwothree
state
Ohio012
Colorado345
#使用该数据的stack方法即可将列转换为行,得到一个Series
result=data.stack()
result
state number Ohio one 0 two 1 three 2 Colorado one 3 two 4 three 5 dtype: int32
#对于一个层次化的Series,可以用unsatck将其重排为一个DataFrame
result.unstack()
numberonetwothree
state
Ohio012
Colorado345
#默认情况下,unstack和stack操作的是最内层,传入分层级别的编号或名称即可对其他级别进行unstack操作:
result.unstack(0)
stateOhioColorado
number
one03
two14
three25
result.unstack('state')
stateOhioColorado
number
one03
two14
three25
#如果不是所有的级别都能在各分组中找到的话,则unstack操作可能会引入缺失数据:
s1=Series([0,1,2,3],index=['a','b','c','d'])
s2=Series([4,5,6],index=['c','d','e'])
data2=pd.concat([s1,s2],keys=['one','two'])
data2
one a 0 b 1 c 2 d 3 two c 4 d 5 e 6 dtype: int64
data2.unstack()
abcde
one0.01.02.03.0NaN
twoNaNNaN4.05.06.0
#stack默认会滤除缺失数据,因此该运算时可逆的
data2.unstack().stack()
one a 0.0 b 1.0 c 2.0 d 3.0 two c 4.0 d 5.0 e 6.0 dtype: float64
data2.unstack().stack(dropna=False)
one a 0.0 b 1.0 c 2.0 d 3.0 e NaN two a NaN b NaN c 4.0 d 5.0 e 6.0 dtype: float64
#在对DataFrame进行unstack操作时,作为旋转轴的级别会成为结果中的最低级别
df=DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
df
sideleftright
statenumber
Ohioone05
two16
three27
Coloradoone38
two49
three510
df.unstack('state')
sideleftright
stateOhioColoradoOhioColorado
number
one0358
two1469
three25710
df.unstack('state').stack('side')
stateOhioColorado
numberside
oneleft03
right58
twoleft14
right69
threeleft25
right710

将“长格式”旋转为“宽格式”

时间序列数据通常是以所谓的“长格式”(long)或“堆叠格式”(stacked)存在数据库或CSV中的

#这里由于书中的数据源问题只贴相关代码,不运行结果了
ldata[:10]#  data item value格式的时间序列数据
pivoted=ldata.pivot('data','item','value')
#前两个参数值分别作为行和列索引的别名,最后一个参数值则是用于填充DataFrame的数据列的列名
pivoted.head()
ldata['value2']=np.random.randn(len(ldata))
#如果忽略最后一个参数,得到的DataFrame就会带有层次化的列
pivoted=ldata.pivot('data','item')
pivoted['value'][:5]
#pivot其实只是一个快捷方法:用set_index创建层次化索引,再用unstack重塑
unstacked=ldata.set_index(['data','item']).unstack('item')

数据转换

移除重复数据

data=DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
data
k1k2
0one1
1one1
2one2
3two3
4two3
5two4
6two4
#DataFrame的duplicated方法返回一个布尔型Series,表示各行是否是重复行
data.duplicated()
0 False 1 True 2 False 3 False 4 True 5 False 6 True dtype: bool
#还有一个与此相关的drop_duplicates方法,返回一个移除了重复行的DataFrame
data.drop_duplicates()
k1k2
0one1
2one2
3two3
5two4
#这两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。假设你还有一列值,且值希望根据k1列过滤重复项
data['v1']=range(7)
data.drop_duplicates(['k1'])
k1k2v1
0one10
3two33
#duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last'则保留最后一个;
data.drop_duplicates(['k1','k2'],keep='last')
k1k2v1
1one11
2one22
4two34
6two46

利用函数的或映射进行数据转换

from pandas import DataFrame,Series
import pandas as pd
import numpy as np
data=DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
                'ounces':[4,3,12,6,7.5,8,3,5,6]})
data
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
meat_to_animal={'bacon':'pig','pulled pork':'pig','pastrami':'cow'
                ,'corned beef':'cow','honey ham':'pig','nova lox':'salmon'}
#Series的map方法可以接受一个函数或含有映射关系的字典型对象
data['animal']=data['food'].map(str.lower).map(meat_to_animal)
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
#我们也可以出入一个完成全部这些工作的函数
data['food'].map(lambda x:meat_to_animal[x.lower()])
0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object

替换值

data=Series([1.,-999.,2.,-999.,-1000.,3.])
data
0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64
data.replace(-999,np.nan)
0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64
#一次性替换多个值,可以传入一个由待替换值组成的列表及一个替换值
data.replace([-999,-1000],np.nan)
0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64
#如果希望对不同的值进行不同的替换,则传入一个由替换关系组成的列表即可
data.replace([-999,-1000],[np.nan,0])
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64
#传入的参数也可以是字典
data.replace({-999:np.nan,-1000:0})
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64

重命名轴索引

#跟Series中的值一样,轴标签也可以通过函数或映射进行转换,从而得到一个新对象。轴还可以被就地修改,而无需新建一个数据结构
data=DataFrame(np.arange(12).reshape((3,4)),index=['Ohio','Colorado','New York'],columns=['one','two','three','four'])
#根Series一样,轴标签也有一个map方法
data.index.map(str.upper)
array([‘OHIO’, ‘COLORADO’, ‘NEW YORK’], dtype=object)
#可以将其赋值给index,这样就可以对DataFrame进行就地修改了
data.index=data.index.map(str.upper)
data
onetwothreefour
OHIO0123
COLORADO4567
NEW YORK891011
#如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是rename
data.rename(index=str.title,columns=str.upper)
ONETWOTHREEFOUR
Ohio0123
Colorado4567
New York891011
#rename可以结合字典型对象实现对部分轴标签的更新
data.rename(index={'OHIO':'INDIANA'},columns={'three':'peekaboo'})
onetwopeekaboofour
INDIANA0123
COLORADO4567
NEW YORK891011
#rename帮我们实现了:复制DataFrame并对其索引和列标签进行赋值,如果希望就地修改某个数据集,传入inplace=True即可
#总是返回DataFrame的引用
_=data.rename(index={'OHIO':'INDIANA'},inplace=True)
data
onetwothreefour
INDIANA0123
COLORADO4567
NEW YORK891011

离散化和面元划分

ages=[20,22,25,27,21,23,37,31,61,45,41,32]
bins=[18,25,35,60,100]
cats=pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, object): [(18, 25]
#pandas返回的是一个特殊的Categorical对象,你可以将其看做一组表示面元名称的字符串。实际上他有一个表示不同分类
#名称的categories数组以及一个为年龄数据进行编号的codes属性
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
Index([‘(18, 25]’, ‘(25, 35]’, ‘(35, 60]’, ‘(60, 100]’], dtype=’object’)
pd.value_counts(cats)
(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64
#跟区间的数学符号一样,圆括号表示开端,方括号表示闭端(包括)那边是闭端可以通过right=False进行修改
pd.cut(ages,[18,26,36,61,100],right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, object): [[18, 26)
#你也可以设置自己的面元名称,将labels选项设置为一个列表或数组即可
group_names=['Youth','YoungAdult','MiddleAged','Senior']
pd.cut(ages,bins,labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, …, YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult] Length: 12 Categories (4, object): [Youth
#如果向cut传入的是面元的数量而不是确切的面元边界,则它会根据数据的最小值和最大值计算等长面元。
data=np.random.randn(20)
pd.cut(data,4,precision=2)
[(-1.31, -0.47], (-1.31, -0.47], (-0.47, 0.37], (0.37, 1.21], (1.21, 2.049], …, (-0.47, 0.37], (-1.31, -0.47], (0.37, 1.21], (-0.47, 0.37], (-0.47, 0.37]] Length: 20 Categories (4, object): [(-1.31, -0.47]
#qcut是一个非常类似cut的函数,它可以根据样本分位数对数据进行面元划分。根据数据的分布情况,cut可能无法使各个面元
# 中含有相同的数据点,而qcut由于使用的是样本分位数,因此可以得到大小基本相等的面元
data=np.random.randn(1000)#正态分布
cats=pd.qcut(data,4)#按四分位数进行切割
cats
[[-3.528, -0.738], (0.623, 3.469], (0.623, 3.469], (-0.092, 0.623], [-3.528, -0.738], …, [-3.528, -0.738], (0.623, 3.469], (0.623, 3.469], [-3.528, -0.738], (0.623, 3.469]] Length: 1000 Categories (4, object): [[-3.528, -0.738]
pd.value_counts(cats)
(0.623, 3.469] 250 (-0.092, 0.623] 250 (-0.738, -0.092] 250 [-3.528, -0.738] 250 dtype: int64
#跟cut一样,可以设置自定义的分位数
pd.qcut(data,[0,0.1,0.5,0.9,1.])
[[-3.528, -1.309], (-0.092, 1.298], (-0.092, 1.298], (-0.092, 1.298], (-1.309, -0.092], …, [-3.528, -1.309], (1.298, 3.469], (1.298, 3.469], (-1.309, -0.092], (-0.092, 1.298]] Length: 1000 Categories (4, object): [[-3.528, -1.309]

检查和过滤异常值

异常值也叫孤立点或离群值(outlier),它的过滤或变换运算在很大程度上其实就是数组运算

import numpy as np
from pandas import DataFrame,Series
import pandas as pd
np.random.seed(12345)
data=DataFrame(np.random.randn(1000,4))
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0676840.0679240.025598-0.002298
std0.9980350.9921061.0068350.996794
min-3.428254-3.548824-3.184377-3.745356
25%-0.774890-0.591841-0.641675-0.644144
50%-0.1164010.1011430.002073-0.013611
75%0.6163660.7802820.6803910.654328
max3.3666262.6536563.2603833.927528
col=data[3]
col[np.abs(col)>3]
97 3.927528 305 -3.399312 400 -3.745356 Name: 3, dtype: float64
#要选出全部含有“超过3或-3的值”的行,你可以利用布尔型DataFrame以及any方法
data[(np.abs(data)>3).any(1)]
0123
5-0.5397410.4769853.248944-1.021228
97-0.7743630.5529360.1060613.927528
102-0.655054-0.5652303.1768730.959533
305-2.3155550.457246-0.025907-3.399312
3240.0501881.9513123.2603830.963301
4000.1463260.508391-0.196713-3.745356
499-0.293333-0.242459-3.0569901.918403
523-3.428254-0.296336-0.439938-0.867165
5860.2751441.179227-3.1843771.369891
808-0.362528-3.5488241.553205-2.186301
9003.366626-2.3722140.8510101.332846
#将值限制在区间-3到3以内
data[np.abs(data)>3]=np.sign(data)*3#np.sign这个ufunc返回的是一个由1和-1组成的数组,表示 原始值的符号
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0676230.0684730.025153-0.002081
std0.9954850.9902531.0039770.989736
min-3.000000-3.000000-3.000000-3.000000
25%-0.774890-0.591841-0.641675-0.644144
50%-0.1164010.1011430.002073-0.013611
75%0.6163660.7802820.6803910.654328
max3.0000002.6536563.0000003.000000

排列和随机采样

利用numpy.ranodm.permutation函数可以轻松实现对Series或DataFrame的列的排列工作(permuting,随机重排序)

df=DataFrame(np.arange(5*4).reshape(5,4))
sampler=np.random.permutation(5)
sampler
array([1, 0, 2, 3, 4])
#然后就可以在基于ix的索引操作或take函数中使用该数组了
df
0123
00123
14567
2891011
312131415
416171819
df.take(sampler)
0123
14567
00123
2891011
312131415
416171819
df.take(np.random.permutation(len(df))[:3])
0123
14567
312131415
416171819
#要通过替换的方式产生样本,最快的方式是通过np.random.randint得到一组随机整数
bag=np.array([5,7,-1,6,4])
sampler=np.random.randint(0,len(bag),size=10)
sampler
array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1])
draws=bag.take(sampler)
draws
array([ 4, 4, -1, -1, -1, 5, 6, 5, 4, 7])

计算指标、哑变量

df=DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
pd.get_dummies(df['key'])
abc
0010
1010
2100
3001
4100
5010
#给指标DataFrame的列加上一个前缀,以便能够跟其他数据进行合并。get_dummies的prefix参数可以实现该功能
dummies=pd.get_dummies(df['key'],prefix='key')
df_with_dummy=df[['data1']].join(dummies)
df_with_dummy
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010
mnames=['movie_id','title','genres']
movies=pd.read_table('ch02/movielens/movies.dat',sep='::',header=None,names=mnames,engine='python')
movies[:10]
movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children’s
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller
genre_iter=(set(x.split('|')) for x in movies.genres)
genres=sorted(set.union(*genre_iter))
dummies=DataFrame(np.zeros((len(movies),len(genres))),columns=genres)
#迭代每一部电影并将dummies各行的项设置为1
for i,gen in enumerate(movies.genres):
    dummies.ix[i,gen.split('|')]=1
#然后,再将其与movies合并起来
movies_windic=movies.join(dummies.add_prefix('Genre_'))
movies_windic.ix[0]
movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Action 0 Genre_Adventure 0 Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Crime 0 Genre_Documentary 0 Genre_Drama 0 Genre_Fantasy 0 Genre_Film-Noir 0 Genre_Horror 0 Genre_Musical 0 Genre_Mystery 0 Genre_Romance 0 Genre_Sci-Fi 0 Genre_Thriller 0 Genre_War 0 Genre_Western 0 Name: 0, dtype: object
values=np.random.rand(10)
values
array([ 0.75603383, 0.90830844, 0.96588737, 0.17373658, 0.87592824, 0.75415641, 0.163486 , 0.23784062, 0.85564381, 0.58743194])
bins=[0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
(0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8](0.8, 1]
000010
100001
200001
310000
400001
500010
610000
701000
800001
900100

字符串操作

字符串对象方法

val='a,b, guido'
val.split(',')
[‘a’, ‘b’, ’ guido’]
pieces=[x.strip() for x in val.split(',')]
pieces
[‘a’, ‘b’, ‘guido’]
first,second,third=pieces
first+'::'+second+'::'+third
‘a::b::guido’
'::'.join(pieces)
‘a::b::guido’
'guido' in val
True
val.index(',')
1
val.find(':')
-1
#注意find和index的区别,如果找不到字符串,index将会引发一个异常(而不是返回-1)
val.index(':')
————————————————————————— ValueError Traceback (most recent call last) in () 1 #注意find和index的区别,如果找不到字符串,index将会引发一个异常(而不是返回-1) —-> 2 val.index(‘:’) ValueError: substring not found
#返回指定子串的出现次数:
val.count(',')
2
#replace用于将指定模式替换为另一个模式,他也常常用户删除模式:传入空字符串
val.replace(',','::')
‘a::b:: guido’
val.replace(',','')
‘ab guido’

正则表达式

#re模块的函数可以分为三个大类:模式匹配、匹配、替换以及拆分
import re
text="foo  bar\t vaz  \tqux"
re.split('\s+',text)
[‘foo’, ‘bar’, ‘vaz’, ‘qux’]
regex=re.compile('\s+')
regex.split(text)
[‘foo’, ‘bar’, ‘vaz’, ‘qux’]
#如果只希望得到匹配regex的所有模式,则可以使用findall方法:
regex.findall(text)
[’ ‘, ‘\t ‘, ’ \t’]
text="""Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
#re.IGNORECASE的作用是使正则表达式对大小写不敏感
regex=re.compile(pattern,flags=re.IGNORECASE)
regex.findall(text)
[‘dave@google.com’, ‘steve@gmail.com’, ‘rob@gmail.com’, ‘ryan@yahoo.com’]
#search返回的是文本中第一个电子邮箱地址(以特殊的匹配项对象形式返回)
m=regex.search(text)
m
text[m.start():m.end()]
‘dave@google.com’
# regex.match则将返回None,因为它只匹配出现在字符串开头的模式
print(regex.match(text))
None
#sub方法会将匹配到的模式替换为指定字符串,并返回所得到的新字符串
print(regex.sub('REDACTED',text))
Dave REDACTED Steve REDACTED Rob REDACTED Ryan REDACTED
#将电子邮箱地址分为三部分
pattern=r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex=re.compile(pattern,flags=re.IGNORECASE)
#由这种正则表达式所产生的匹配项对象,可以通过其groups方法返回一个由模式个段组成的元组
m=regex.match('wesm@bright.net')
m.groups()
(‘wesm’, ‘bright’, ‘net’)
#对于带有分组功能的模式,findall会返回一个元组列表
regex.findall(text)
[(‘dave’, ‘google’, ‘com’), (‘steve’, ‘gmail’, ‘com’), (‘rob’, ‘gmail’, ‘com’), (‘ryan’, ‘yahoo’, ‘com’)]
#sub还能通过注入\1、\2之类的特殊符号访问各匹配项中的分组
print(regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text))
Dave Username:dave,Domain:google,Suffix:com Steve Username:steve,Domain:gmail,Suffix:com Rob Username:rob,Domain:gmail,Suffix:com Ryan Username:ryan,Domain:yahoo,Suffix:com
#对正则表达式做一点小变动:为每个匹配分组加上一个名称
regex=re.compile(r'''
(?P<username>[A-Z0-9._%+-]+)
@
(?P<domin>[A-Z0-9.-]+)
\.
(?P<suffix>[A-Z]{2,4})
''',flags=re.IGNORECASE|re.VERBOSE)
#由这种正则表达式所产生的匹配项对象可以得到一个简单易用的带有分组名称的字典
m=regex.match('wesm@bright.net')
m.groupdict()
{‘domin’: ‘bright’, ‘suffix’: ‘net’, ‘username’: ‘wesm’} ###pandas中矢量化的字符串函数
data={'Dave':'dave@google.com','Steve':'steve@gmail.com',
      'Rob':'rob@gmial.com','Wes':np.nan}
data=Series(data)
data
Dave dave@google.com Rob rob@gmial.com Steve steve@gmail.com Wes NaN dtype: object
data.isnull()
Dave False Rob False Steve False Wes True dtype: bool
#通过data.map,所有字符串和正则表达式方法都能被应用于(传入lambda表达式或其他函数)
#各个值,但是如果存在NA就会报错。为了解决这个问题,Series有一些能够跳过NA值的字符串操作方法
#通过Series的str属性就可访问这些方法
#通过str.contains检查各个电子邮件地址是否含有“gmail”
data.str.contains('gmail')
Dave False Rob False Steve True Wes NaN dtype: object
#使用正则表达式
pattern
‘([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})’
data.str.findall(pattern,flags=re.IGNORECASE)
Dave [(dave, google, com)] Rob [(rob, gmial, com)] Steve [(steve, gmail, com)] Wes NaN dtype: object
#有两个方法可以实现矢量化的元素获取操作:要么使用str.get,要么在str属性上使用索引
matches=data.str.match(pattern,flags=re.IGNORECASE)
matches
C:\Users\ZJL\AppData\Local\Programs\Python\Python35\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: In future versions of pandas, match will change to always return a bool indexer. Dave (dave, google, com) Rob (rob, gmial, com) Steve (steve, gmail, com) Wes NaN dtype: object
matches.str.get(1)
Dave google Rob gmial Steve gmail Wes NaN dtype: object
matches.str[0]
Dave dave Rob rob Steve steve Wes NaN dtype: object
#可以用下面这种代码对字符串进行子串截取
data.str[:5]
Dave dave@ Rob rob@g Steve steve Wes NaN dtype: object

示例:USDA食品数据库

import json
db=json.load(open('ch07/foods-2011-10-03.json'))
len(db)
6636
#db中的米格条目都是一个含有某种事物全部数据的字典。nutrients字段是一个字典列表,
#其中的每个字典对应一种营养成分
db[0].keys()
dict_keys([‘description’, ‘manufacturer’, ‘nutrients’, ‘group’, ‘id’, ‘portions’, ‘tags’])
db[0]['nutrients'][0]
{‘description’: ‘Protein’, ‘group’: ‘Composition’, ‘units’: ‘g’, ‘value’: 25.18}
nutrients=DataFrame(db[0]['nutrients'])
nutrients[:7]
descriptiongroupunitsvalue
0ProteinCompositiong25.18
1Total lipid (fat)Compositiong29.20
2Carbohydrate, by differenceCompositiong3.06
3AshOtherg3.28
4EnergyEnergykcal376.00
5WaterCompositiong39.28
6EnergyEnergykJ1573.00
#将字典列表转换为DataFrame时,可以只抽取其中的一部分字段。这里我们将取出事物的名称、
# 分类、编号及制造商等信息
info_keys=['description','group','id','manufacturer']
info=DataFrame(db,columns=info_keys)
info[:5]
descriptiongroupidmanufacturer
0Cheese, carawayDairy and Egg Products1008
1Cheese, cheddarDairy and Egg Products1009
2Cheese, edamDairy and Egg Products1018
3Cheese, fetaDairy and Egg Products1019
4Cheese, mozzarella, part skim milkDairy and Egg Products1028
pd.value_counts(info.group)[:10]
Vegetables and Vegetable Products 812 Beef Products 618 Baked Products 496 Breakfast Cereals 403 Fast Foods 365 Legumes and Legume Products 365 Lamb, Veal, and Game Products 345 Sweets 341 Fruits and Fruit Juices 328 Pork Products 328 Name: group, dtype: int64
#下面将所有食物的营养成分整合到一个大表中
nutrients=[]
for rec in db:
    fnuts=DataFrame(rec['nutrients'])
    fnuts['id']=rec['id']
    nutrients.append((fnuts))
nutrients=pd.concat(nutrients,ignore_index=True)    
nutrients.head()
descriptiongroupunitsvalueid
0ProteinCompositiong25.181008
1Total lipid (fat)Compositiong29.201008
2Carbohydrate, by differenceCompositiong3.061008
3AshOtherg3.281008
4EnergyEnergykcal376.001008
#丢弃重复项
nutrients.duplicated().sum()
14179
nutrients=nutrients.drop_duplicates()
#由于两个DataFrame钟都有“group”和“description”,所以为了明确,我们需要对他们进行重命名
col_mapping={'description':'food','group':'fgroup'}
info=info.rename(columns=col_mapping,copy=False)
info.head()
foodfgroupidmanufacturer
0Cheese, carawayDairy and Egg Products1008
1Cheese, cheddarDairy and Egg Products1009
2Cheese, edamDairy and Egg Products1018
3Cheese, fetaDairy and Egg Products1019
4Cheese, mozzarella, part skim milkDairy and Egg Products1028
col_mapping={'description':'nutrient','group':'nutgroup'}
nutrients=nutrients.rename(columns=col_mapping,copy=False)
#将info和nutrients合并
ndata=pd.merge(nutrients,info,on='id',how='outer')
ndata.head()
nutrientnutgroupunitsvalueidfoodfgroupmanufacturer
0ProteinCompositiong25.181008Cheese, carawayDairy and Egg Products
1Total lipid (fat)Compositiong29.201008Cheese, carawayDairy and Egg Products
2Carbohydrate, by differenceCompositiong3.061008Cheese, carawayDairy and Egg Products
3AshOtherg3.281008Cheese, carawayDairy and Egg Products
4EnergyEnergykcal376.001008Cheese, carawayDairy and Egg Products
#根据食物分类和营养类型画出一张中位值图
import matplotlib.pyplot as plt
result=ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)
result['Energy'].sort_values().plot(kind='barh')
plt.show()

根据营养分类的出的中位值

#发现各营养成分中最为丰富的食物是什么
by_nutrient=ndata.groupby(['nutgroup','nutrient'])
get_maximum=lambda x:x.xs(x.value.idxmax())
get_miximum=lambda x:x.xs(x.value.idxmix())
max_foods=by_nutrient.apply(get_maximum)[['value','food']]
#让food小一点
max_foods.food=max_foods.food.str[:50]
max_foods.ix['Amino Acids']['food']
nutrient
Alanine                           Gelatins, dry powder, unsweetened
Arginine                               Seeds, sesame flour, low-fat
Aspartic acid                                   Soy protein isolate
Cystine                Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                   Soy protein isolate
Glycine                           Gelatins, dry powder, unsweetened
Histidine                Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine                    Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline                           Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan         Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, dtype: object
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值