1、处理缺失数据
在许多数据分析工作中,缺失数据是经常发生的。pandas的目标之一就是尽量轻松地处理缺失数据。例如,pandas对象的所有描述性统计默认都不包括缺失数据。pandas使用浮点值NaN(Not a Number)表示缺失数据,我们称其为哨兵值,可以方便的检测出来:
In [20]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [21]: string_data
Out[21]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [22]: string_data.isnull()
Out[22]:
0 False
1 False
2 True
3 False
dtype: bool
Python内置的None值在对象数组中也可以作为NA:
In [24]: string_data[0] = None
In [25]: string_data.isnull()
Out[25]:
0 True
1 False
2 True
3 False
dtype: bool
滤除缺失数据
对于一个Series,dropna()函数返回一个仅含非空数据和索引值的Series,如:
In [26]: from numpy import nan as NA
In [27]: data = pd.Series([1, NA, 3.5, NA, 7])
In [28]: data.dropna()
Out[28]:
0 1.0
2 3.5
4 7.0
dtype: float64
等价于:
In [29]: data[data.notnull()]
Out[29]:
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame,dropna()函数默认丢弃任何含有缺失项的行:
In [30]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
In [31]: cleaned = pd.dropna()
In [32]: cleaned = data.dropna()
In [33]: cleaned
Out[33]:
0 1 2
0 1.0 6.5 3.0
In [34]: data
Out[34]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
but可以传入how=‘all’,这样只会滤除那些全部为NA的行:
In [35]: data.dropna(how = 'all')
Out[35]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
如果要以相同的方式丢弃列,则传入axis=1即可:
In [37]: data[4] = NA
In [38]: data
Out[38]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
In [39]: data.dropna(how='all',axis=1)
Out[39]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
也可以传入thresh参数达到过滤目的:thresh=n表示保留至少n个以上的非NA值的行或列。比如 thresh=3,会在一行中至少有 3 个非 NA 值时将其保留。
In [43]: df = pd.DataFrame(np.random.randn(7, 3))
In [45]: df.iloc[:4,1] = NA
In [46]: df.iloc[:2,2] = NA
In [47]: df
Out[47]:
0 1 2
0 -1.609312 NaN NaN
1 0.241571 NaN NaN
2 0.764884 NaN -1.532268
3 -1.892363 NaN -0.423333
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
In [48]: df.dropna()
Out[48]:
0 1 2
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
In [49]: In [32]: df.dropna(thresh=2)
Out[49]:
0 1 2
2 0.764884 NaN -1.532268
3 -1.892363 NaN -0.423333
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
In [50]: df.dropna(thresh=3)
Out[50]:
0 1 2
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
填充缺失数据
if不想滤除缺失数据而是想用其他方式填补这些空缺,可用fillna()函数:
In [51]: df.fillna(0)
Out[51]:
0 1 2
0 -1.609312 0.000000 0.000000
1 0.241571 0.000000 0.000000
2 0.764884 0.000000 -1.532268
3 -1.892363 0.000000 -0.423333
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值:
In [53]: df.fillna({1:20,2:600})
Out[53]:
0 1 2
0 -1.609312 20.000000 600.000000
1 0.241571 20.000000 600.000000
2 0.764884 20.000000 -1.532268
3 -1.892363 20.000000 -0.423333
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
注意
drop()、dropna()、fillna()均返回一个新的对象(即不在原对象上进行修改),如要直接修改原对象,则需传入inplace = True:
In [54]: df.fillna(0,inplace=True)
In [55]: df
Out[55]:
0 1 2
0 -1.609312 0.000000 0.000000
1 0.241571 0.000000 0.000000
2 0.764884 0.000000 -1.532268
3 -1.892363 0.000000 -0.423333
4 -1.984312 0.913495 -1.371564
5 -0.969473 -0.171819 0.418801
6 2.116447 0.831522 -1.888933
2、数据转换
DataFrame的duplicated方法返回一个布尔型Series,表示各行是否是重复行(前面出现过的行),
In [58]: data = pd.DataFrame({'k1':['one','two'] * 3 + ['two'],'k2':[1,1,2,3,3,4,4]})
In [59]: data
Out[59]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [60]: data.duplicated()
Out[60]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
另一个drop_duplicates()方法用于丢弃重复行:
In [63]: data.drop_duplicates()
Out[63]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
duplicated()和drop_duplicates()方法默认判断全部列,如果不想这样,传入列的集合作为参数可以指定按列判断,例如:
In [64]: data.duplicated()
Out[64]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [65]: data.duplicated(['k1'])
Out[65]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
dtype: bool
In [66]: data.drop_duplicates(['k1'])
Out[66]:
k1 k2
0 one 1
1 two 1
duplicated()和drop_duplicates()方法默认保留第一个出现的值,传入keep='last’则保留最后一个值:
In [64]: data.duplicated()
Out[64]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [69]: data.duplicated(keep='last')
Out[69]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
dtype: bool
In [63]: data.drop_duplicates()
Out[63]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
In [67]: data.drop_duplicates(keep='last')
Out[67]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
6 two 4
利用函数或映射进行数据转换
In [70]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
...: 'Pastrami', 'corned beef', 'Bacon',
...: 'pastrami', 'honey ham', 'nova lox'],
...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
...:
In [71]: data
Out[71]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
假设你想要添加一列表示该肉类食物来源的动物类型。我们先编写一个不同肉类到动物的映射:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
Series的map方法可以接受一个函数或含有映射关系的字典型对象:
n [80]: data['animal'] = data['food'].map(meat_to_animal)
In [81]: data
Out[81]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 NaN
4 corned beef 7.5 cow
5 Bacon 8.0 NaN
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
也可以用函数达到同样的效果:
In [88]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[88]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
注:此处data里的food要全部变成首字母小写才会正确运行。
替换值
替换用replace()方法
In [92]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [93]: data
Out[93]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
In [94]: data.replace(-999,100)
Out[94]:
0 1.0
1 100.0
2 2.0
3 100.0
4 -1000.0
5 3.0
dtype: float64
也可以通过传入带替换值的列表一次替换多个值:
In [95]: data.replace([-999,-100],np.nan)
Out[95]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
对不同的值进行替换,可以传递一个替换列表:
In [96]: data.replace([-999,-1000],[np.nan,0])
Out[96]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
传入的参数也可以是字典
重命名索引
用rename()返回新的数据集
In [97]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
...: index=['Ohio', 'Colorado', 'New York'],
...: columns=['one', 'two', 'three', 'four'])
In [98]: data
Out[98]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [100]: data.rename(index=str.title,columns=str.upper)
Out[100]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
特别说明一下,rename可以结合字典型对象实现对部分轴标签的更新:
In [112]: data.rename(index ={'OHIO':'INDIANA'},columns={'three':'peelbk'})
Out[112]:
one two peelbk four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
如要就地对DataFrame进行修改,只需传入inplace=True。
将数据划分为不同的组
In [113]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [114]: bins = [18, 25, 35, 60, 100] #划分的范围
In [116]: cats = pd.cut(ages,bins)
In [117]: cats
Out[117]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
In [119]: cats.codes
Out[119]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [120]: cats.categories
Out[120]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
In [122]: pd.value_counts(cats) #value_counts()是pandas.cut()的计数结果
Out[122]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
划分区间的开闭号可由right=False进行修改(表示左闭右开):
In [123]: pd.cut(ages,bins,right=False)
Out[123]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
可以通过传递一个列表或数组到labels,设置自己的面元名称:
In [124]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
In [125]: pd.cut(ages,bins,labels=group_names)
Out[125]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
如果向cut传入的是面元的数量而不是确切的面元边界,则它会根据数据的最小值和最大值计算等长面元。下面这个例子中,我们将一些均匀分布的数据分成四组:
In [85]: data = np.random.rand(20)
In [86]: pd.cut(data, 4, precision=2) #precision=2表示精确到小数点后两位
Out[86]:
[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34
, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, interval[float64]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] <
(0.76, 0.97]]
此外,qcut是一个非常类似于cut的函数,它可以根据样本分位数对数据进行面元划分。根据数据的分布情况,cut可能无法使各个面元中含有相同数量的数据点。而qcut由于使用的是样本分位数,因此可以得到大小基本相等的面元:
In [87]: data = np.random.randn(1000) # Normally distributed
In [88]: cats = pd.qcut(data, 4) # Cut into quartiles
In [129]: cats
Out[129]:
[(-0.00264, 0.68], (0.68, 3.714], (0.68, 3.714], (-0.00264, 0.68], (0.68, 3.714], ..., (0.68, 3.714], (-0.708, -0.00264], (-3.294, -0.708], (0.68, 3.714], (-0.708, -0.00264]]
Length: 1000
Categories (4, interval[float64]): [(-3.294, -0.708] < (-0.708, -0.00264] < (-0.00264, 0.68] <
(0.68, 3.714]]
In [90]: pd.value_counts(cats)
Out[90]:
(0.62, 3.928] 250
(-0.0265, 0.62] 250
(-0.68, -0.0265] 250
(-2.95, -0.68] 250
dtype: int64
与cut类似,也可以传递自定义的分位数(0到1之间的数值,包含端点):
In [133]: pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out[133]:
[(-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], ..., (-0.00264, 1.363], (-1.251, -0.00264], (-1.251, -0.00264], (-0.00264, 1.363], (-1.251, -0.00264]]
Length: 1000
Categories (4, interval[float64]): [(-3.294, -1.251] < (-1.251, -0.00264] < (-0.00264, 1.363] <
(1.363, 3.714]]
检测和过滤异常值
找出绝对值大于2的值:
In [142]: data = pd.DataFrame(np.random.randn(100, 4))
In [143]: data[np.abs(data)>2]
Out[143]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 -2.782408 NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN -2.473652 NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN 2.231629 NaN NaN
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN -2.082585
19 NaN NaN NaN NaN
20 NaN NaN NaN NaN
21 NaN NaN NaN NaN
22 NaN NaN NaN NaN
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN NaN
26 NaN NaN NaN NaN
27 NaN NaN NaN NaN
28 NaN NaN NaN NaN
29 NaN NaN NaN NaN
.. ... ... ... ...
70 NaN NaN NaN NaN
71 NaN NaN NaN NaN
72 NaN NaN NaN NaN
73 NaN NaN NaN NaN
74 NaN NaN NaN NaN
75 NaN -2.974081 NaN NaN
76 NaN NaN NaN NaN
77 NaN NaN NaN NaN
78 NaN NaN NaN NaN
79 NaN NaN NaN NaN
80 NaN NaN NaN NaN
81 NaN NaN NaN NaN
82 NaN NaN NaN NaN
83 NaN NaN NaN NaN
84 NaN NaN NaN NaN
85 NaN NaN NaN -2.195861
86 NaN NaN NaN NaN
87 NaN NaN NaN NaN
88 NaN NaN NaN NaN
89 NaN NaN NaN NaN
90 NaN NaN NaN 2.260145
91 NaN NaN NaN NaN
92 NaN 2.260530 NaN NaN
93 NaN NaN NaN NaN
94 NaN NaN NaN NaN
95 NaN NaN NaN NaN
96 NaN NaN NaN NaN
97 NaN NaN NaN NaN
98 NaN NaN NaN NaN
99 2.957832 NaN NaN NaN
找出绝对值大于2的行:
[100 rows x 4 columns]
In [146]: data[(np.abs(data) > 2).any(1)]
Out[146]:
0 1 2 3
5 -2.782408 -0.987358 -1.551619 -1.526386
8 0.724653 0.384637 -2.473652 0.813940
12 -1.283665 2.231629 -0.998333 0.429922
18 0.178538 -1.130435 -0.014909 -2.082585
32 0.866329 0.374657 2.587866 -0.488257
64 0.231327 0.624478 -2.408233 1.623262
66 -1.283825 -0.406443 2.732196 -0.789880
67 -0.795824 -2.844512 -0.632061 -0.129175
68 1.078212 1.464731 -0.607920 2.617917
75 1.304676 -2.974081 0.262357 0.519151
85 -0.027685 -0.542753 1.026821 -2.195861
90 -1.313300 0.146344 -1.850061 2.260145
92 -0.179965 2.260530 0.064706 -0.491307
99 2.957832 -0.382531 -0.679123 -1.211802
根据数据的值是正还是负,np.sign(data)可以生成1和-1:
In [147]: np.sign(data).head() #只输出前五行
Out[147]:
0 1 2 3
0 1.0 -1.0 1.0 -1.0
1 -1.0 1.0 1.0 -1.0
2 1.0 1.0 1.0 -1.0
3 -1.0 -1.0 -1.0 1.0
4 -1.0 -1.0 1.0 -1.0
排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation,可产生一个表示新顺序的整数数组
计算指标/哑变量
另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量”或“指标矩阵”。(特征因子化)
如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)。pandas有一个get_dummies函数可以实现该功能(其实自己动手做一个也不难)。使用之前的一个DataFrame例子:
In [157]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
In [158]: df
Out[158]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [159]: pd.get_dummies(df['key'],prefix='key') #prefix表示给指标加上一个前缀名
Out[159]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
3、字符串操作
In [160]: val = 'a,b, guido'
In [161]: val.split(',')
Out[161]: ['a', 'b', ' guido']
split常常与strip一起使用,以去除空白符(包括换行符):
In [165]: arr = [x.strip() for x in val.split(',')]
In [166]: arr
Out[166]: ['a', 'b', 'guido']
用join()方法连接字符串:
In [167]: '::'.join(arr)
Out[167]: 'a::b::guido'
in关键字判断一个字符串是否包含在另一个字符串中:
In [168]: 'guido' in arr
Out[168]: True
index()和find()判断一个子字符串的位置
In [169]: val.index(',')
Out[169]: 1
In [170]: val.find(':')
Out[170]: -1
注意find和index的区别:如果找不到字符串,index将会引发一个异常(而不是返回-1):
In [172]: val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-172-2c016e7367ac> in <module>()
----> 1 val.index(':')
ValueError: substring not found
与此相关,count()方法可以返回指定子串的出现次数:
In [173]: val.count(',')
Out[173]: 2
replace()方法替换子字符串:
In [174]: val.replace(',','::')
Out[174]: 'a::b:: guido'
正则表达式
正则表达式提供了一种灵活的在文本中搜索或匹配(通常比前者复杂)字符串模式的方式。正则表达式,常称作regex,是根据正则表达式语言编写的字符串。Python内置的re模块负责对字符串应用正则表达式,re模块一般的操作分为三类:模式匹配、替换和拆分。
们先来看一个简单的例子:假设我想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的regex是\s+:
In [175]: import re
In [176]: text = 'foo bar\t baz \tqux'
In [177]: re.split('\s+',text)
Out[177]: ['foo', 'bar', 'baz', 'qux']
这章先放放,明天再写