1.合并重叠数据
你可能有索引全部或部分重叠的两个数据集。我们使用NumPy的where函数,它用于表达一种矢量化的if-else
a=Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b=Series(np.arange(len(a),dtype=np.float64),index=['f','e','d','c','b','a'])
b[-1]=np.nan
print a
print b
print np.where(pd.isnull(a),b,a)
结果为:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0
e 1
d 2
c 3
b 4
a NaN
dtype: float64
[ 0. 2.5 2. 3.5 4.5 nan]
Series有一个combine_first方法,实现的也是一样的功能,而且会进行数据对齐
print b[:-2].combine_first(a[2:])
结果为:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
对于DataFrame,combine_first自然也会在列上做同样的事情。因此你可以将其看做,用参数对象中的数据为调用者对象的缺失数据“打补丁”
df1=DataFrame({'a':[1,np.nan,5,np.nan],
'b':[np.nan,2,np.nan,6],
'c':range(2,18,4)})
df2=DataFrame({'a':[5,4,np.nan,3,7],
'b':[np.nan,3,4,6,8]})
print df1
print df2
print df1.combine_first(df2)
结果为:
a b c
0 1 NaN 2
1 NaN 2 6
2 5 NaN 10
3 NaN 6 14
a b
0 5 NaN
1 4 3
2 NaN 4
3 3 6
4 7 8
a b c
0 1 NaN 2
1 4 2 6
2 5 4 10
3 3 6 14
4 7 8 NaN
2.重塑层次化索引
主要功能有二个:a.stack:将数据的列“旋转”为行;b.unstack:将数据的行“旋转”为列
data=DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['Ohio','Colorado'],name='state'),
columns=pd.Index(['one','two','three'],name='number'))
print data
结果为:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
使用该数据的stack方法即可将列转换为行,得到一个Series
result=data.stack()
print result
结果为:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
对于一个层次化索引的Series,你可以用unstack将其重排位一个DataFrame
print result.unstack()
结果为:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
默认情况下,stack和unstack操作的是最内层。传入分层级别的编号或名称即可对其他级别进行unstack操作
print result.unstack(0)
print result.unstack('state')
结果为:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
如果不是所有级别值都能在各分组中找到的话,则unstack操作可能会引入缺失数据
s1=Series([0,1,2,3],index=['a','b','c','d'])
s2=Series([4,5,6],index=['c','d','e'])
data2=pd.concat([s1,s2],keys=['one','two'])
print data2
print data2.unstack()
结果为:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
a b c d e
one 0 1 2 3 NaN
two NaN NaN 4 5 6
stack默认会过滤缺失数据,因此该运算时可逆的
print data2.unstack().stack()
print data2.unstack().stack(dropna=False)
结果为:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: float64
one a 0
b 1
c 2
d 3
e NaN
two a NaN
b NaN
c 4
d 5
e 6
dtype: float64
在对DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别
df=DataFrame({'left':result,'right':result+5},
columns=pd.Index(['left','right'],name='side'))
print result
print df
结果为:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
print df.unstack('state')
print df.unstack('state').stack('side')
结果为:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
state Ohio Colorado
number side
one left 0 3
right 5 8
two left 1 4
right 6 9
three left 2 5
right 7 10
2.将“长格式”旋转为“宽格式”
时间序列数据通常是已所谓的“长格式”或“堆叠格式”存储在数据库和CSV中
ldata=pd.read_excel('data/data.xlsx')
print ldata
pivoted=ldata.pivot('date','item','value')
print pivoted
结果为:
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 NaN 2785.204 NaN
前两个参数值分别用做行和列索引的列名,最后一个参数值则是用于填充DataFrame的数据列的列名。假设有两个需要参与重塑的数据列
ldata['valuel2']=np.random.randn(len(ldata))
print ldata
结果为:
date item value valuel2
0 1959-03-31 realgdp 2710.349 -1.637471
1 1959-03-31 infl 0.000 1.396857
2 1959-03-31 unemp 5.800 -0.583512
3 1959-06-30 realgdp 2778.801 -0.556246
4 1959-06-30 infl 2.340 -2.283884
5 1959-06-30 unemp 5.100 0.822694
6 1959-09-30 realgdp 2775.488 -1.878675
7 1959-09-30 infl 2.740 -0.143054
8 1959-09-30 unemp 5.300 -0.497226
9 1959-12-31 realgdp 2785.204 0.297912
如果忽略最后一个参数,得到的DataFrame就会带有层次化的列
pivoted=ldata.pivot('date','item')
print pivoted
结果为:
value valuel2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 1.279229 -0.268940 -1.257790
1959-06-30 2.34 2778.801 5.1 0.001542 -0.327840 0.048699
1959-09-30 2.74 2775.488 5.3 0.672693 1.057676 -0.144023
1959-12-31 NaN 2785.204 NaN NaN 0.376359 NaN
pivot其实只是一个快捷方式而已:用set_index创建层次索引,再用unstack重塑
unstacked=ldata.set_index(['date','item']).unstack('item')
print unstacked
结果为:
value valuel2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.063869 0.968893 0.265408
1959-06-30 2.34 2778.801 5.1 0.996808 -1.172076 -0.714991
1959-09-30 2.74 2775.488 5.3 1.418273 -0.216747 0.741781
1959-12-31 NaN 2785.204 NaN NaN 1.526784 NaN
3.移除重复数据
DataFrame的duplicated方法返回一个布尔型Series,表示各行是否是重复行
data3=DataFrame({'k1':['one']*3+['two']*4,
'k2':[1,1,2,3,3,4,4]})
print data3
print data3.duplicated()
结果为:
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
还有一个与此相关的drop_duplicates方法,它用于返回一个移动了重复行的DataFrame
print data3.drop_duplicates()
结果为:
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
这两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。
data3['v1']=range(7)
print data3
print data3.drop_duplicates(['k1'])
结果为:
k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
k1 k2 v1
0 one 1 0
3 two 3 3
drop_duplicates和duplicated默认保留的是第一个出现的值组合。传入则保留最后一个
print data3.drop_duplicates(['k1','k2'],take_last=True)
结果为:
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6
4.利用函数或映射进行数据转换
Series的map方法可以接受一个函数或含有映射关系的子典型对象。但是这里有一个问题,即有些肉类的首字母大写了,因此,我们需要将各个值转换为小写
data4=DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
'onces':[4,3,12,6,7.5,8,3,5,6]})
print data4
meat_to_animal={
'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
data4['animal']=data4['food'].map(str.lower).map(meat_to_animal)
print data4
结果为:
food onces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
food onces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
我们也可以传入一个能够完成全部这些工作的函数
print data4['food'].map(lambda x:meat_to_animal[x.lower()])
结果为:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
5.替换值
data5=Series([1,-999,2,-999,-1000,3])
print data5.replace(-999,np.nan)
结果为:
0 1
1 NaN
2 2
3 NaN
4 -1000
5 3
dtype: float64
希望一次替换多个值
print data5.replace([-999,-1000],np.nan)
结果为:
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
dtype: float64
如果希望对不同的值进行不同的替换,则传入一个由替换关系组成的列表即可
print data5.replace([-999,-1000],[np.nan,0])
结果为:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
传入的参数也可以是字典
print data5.replace({-999:np.nan,-1000:0})
结果为:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
6.重命名轴索引
data6=DataFrame(np.arange(12).reshape((3,4)),
index=['Ohio','Colorado','New York'],
columns=['one','two','three','four'])
print data6.index.map(str.upper)
结果为:
['OHIO' 'COLORADO' 'NEW YORK']
可以将其赋值给index,这样就可以对DataFrame进行就地修改了
data6.index=data6.index.map(str.upper)
print data6
结果为:
one two three four
OHIO 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
如果想要创建数据集的转换版(而不是修改原数据),比较实用的方法是rename
print data6.rename(index=str.title,columns=str.upper)
结果为:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
rename可以结合字典型对象实现对部分轴标签的更新
print data6.rename(index={'OHIO':'INDIANA'},
columns={'three':'peekaboo'})
结果为:
one two peekaboo four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
rename帮我们实现了:复制DataFrame并对其索引和列标签进行赋值。如果希望就地修改某个数据集,传入inplace=True即可
_=data6.rename(index={'OHIO':'INDIANA'},inplace=True)
print data6
结果为:
one two three four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11