《利用Pyhton进行数据分析》学习记录——数据清洗和准备

1、处理缺失数据

在许多数据分析工作中,缺失数据是经常发生的。pandas的目标之一就是尽量轻松地处理缺失数据。例如,pandas对象的所有描述性统计默认都不包括缺失数据。pandas使用浮点值NaN(Not a Number)表示缺失数据,我们称其为哨兵值,可以方便的检测出来:

In [20]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [21]: string_data
Out[21]:
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [22]: string_data.isnull()
Out[22]:
0    False
1    False
2     True
3    False
dtype: bool

Python内置的None值在对象数组中也可以作为NA:

In [24]: string_data[0] = None

In [25]: string_data.isnull()
Out[25]:
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据

对于一个Series,dropna()函数返回一个仅含非空数据和索引值的Series,如:

In [26]: from numpy import nan as NA

In [27]: data = pd.Series([1, NA, 3.5, NA, 7])

In [28]: data.dropna()
Out[28]:
0    1.0
2    3.5
4    7.0
dtype: float64

等价于:

In [29]: data[data.notnull()]
Out[29]:
0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame,dropna()函数默认丢弃任何含有缺失项的行:

In [30]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

In [31]: cleaned = pd.dropna()


In [32]: cleaned = data.dropna()

In [33]: cleaned
Out[33]:
     0    1    2
0  1.0  6.5  3.0

In [34]: data
Out[34]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

but可以传入how=‘all’,这样只会滤除那些全部为NA的行:

In [35]: data.dropna(how = 'all')
Out[35]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

如果要以相同的方式丢弃列,则传入axis=1即可:

In [37]: data[4] = NA

In [38]: data
Out[38]:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN

In [39]: data.dropna(how='all',axis=1)
Out[39]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

也可以传入thresh参数达到过滤目的:thresh=n表示保留至少n个以上的非NA值的行或列。比如 thresh=3,会在一行中至少有 3 个非 NA 值时将其保留。

In [43]: df = pd.DataFrame(np.random.randn(7, 3))

In [45]: df.iloc[:4,1] = NA
In [46]: df.iloc[:2,2] = NA
In [47]: df
Out[47]:
          0         1         2
0 -1.609312       NaN       NaN
1  0.241571       NaN       NaN
2  0.764884       NaN -1.532268
3 -1.892363       NaN -0.423333
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

In [48]: df.dropna()
Out[48]:
          0         1         2
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

In [49]: In [32]: df.dropna(thresh=2)
Out[49]:
          0         1         2
2  0.764884       NaN -1.532268
3 -1.892363       NaN -0.423333
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

In [50]: df.dropna(thresh=3)
Out[50]:
          0         1         2
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

填充缺失数据

if不想滤除缺失数据而是想用其他方式填补这些空缺,可用fillna()函数:

In [51]: df.fillna(0)
Out[51]:
          0         1         2
0 -1.609312  0.000000  0.000000
1  0.241571  0.000000  0.000000
2  0.764884  0.000000 -1.532268
3 -1.892363  0.000000 -0.423333
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值:

In [53]: df.fillna({1:20,2:600})
Out[53]:
          0          1           2
0 -1.609312  20.000000  600.000000
1  0.241571  20.000000  600.000000
2  0.764884  20.000000   -1.532268
3 -1.892363  20.000000   -0.423333
4 -1.984312   0.913495   -1.371564
5 -0.969473  -0.171819    0.418801
6  2.116447   0.831522   -1.888933

注意

drop()、dropna()、fillna()均返回一个新的对象(即不在原对象上进行修改),如要直接修改原对象,则需传入inplace = True:

In [54]: df.fillna(0,inplace=True)

In [55]: df
Out[55]:
          0         1         2
0 -1.609312  0.000000  0.000000
1  0.241571  0.000000  0.000000
2  0.764884  0.000000 -1.532268
3 -1.892363  0.000000 -0.423333
4 -1.984312  0.913495 -1.371564
5 -0.969473 -0.171819  0.418801
6  2.116447  0.831522 -1.888933

2、数据转换

DataFrame的duplicated方法返回一个布尔型Series,表示各行是否是重复行(前面出现过的行),

In [58]: data = pd.DataFrame({'k1':['one','two'] * 3 + ['two'],'k2':[1,1,2,3,3,4,4]})

In [59]: data
Out[59]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

In [60]: data.duplicated()
Out[60]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

另一个drop_duplicates()方法用于丢弃重复行:

In [63]: data.drop_duplicates()
Out[63]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

duplicated()和drop_duplicates()方法默认判断全部列,如果不想这样,传入列的集合作为参数可以指定按列判断,例如:

In [64]: data.duplicated()
Out[64]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [65]: data.duplicated(['k1'])
Out[65]:
0    False
1    False
2     True
3     True
4     True
5     True
6     True
dtype: bool

In [66]: data.drop_duplicates(['k1'])
Out[66]:
    k1  k2
0  one   1
1  two   1

duplicated()和drop_duplicates()方法默认保留第一个出现的值,传入keep='last’则保留最后一个值:

In [64]: data.duplicated()
Out[64]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [69]: data.duplicated(keep='last')
Out[69]:
0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

In [63]: data.drop_duplicates()
Out[63]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

In [67]: data.drop_duplicates(keep='last')
Out[67]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
6  two   4

利用函数或映射进行数据转换

In [70]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
    ...:                               'Pastrami', 'corned beef', 'Bacon',
    ...:                               'pastrami', 'honey ham', 'nova lox'],
    ...:                      'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
    ...:

In [71]: data
Out[71]:
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

假设你想要添加一列表示该肉类食物来源的动物类型。我们先编写一个不同肉类到动物的映射:

meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

Series的map方法可以接受一个函数或含有映射关系的字典型对象:

n [80]: data['animal'] = data['food'].map(meat_to_animal)

In [81]: data
Out[81]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     NaN
4  corned beef     7.5     cow
5        Bacon     8.0     NaN
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

也可以用函数达到同样的效果:

In [88]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[88]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

注:此处data里的food要全部变成首字母小写才会正确运行。

替换值

替换用replace()方法

In [92]: data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [93]: data
Out[93]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [94]: data.replace(-999,100)
Out[94]:
0       1.0
1     100.0
2       2.0
3     100.0
4   -1000.0
5       3.0
dtype: float64

也可以通过传入带替换值的列表一次替换多个值:

In [95]: data.replace([-999,-100],np.nan)
Out[95]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

对不同的值进行替换,可以传递一个替换列表:

In [96]: data.replace([-999,-1000],[np.nan,0])
Out[96]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

传入的参数也可以是字典

重命名索引

用rename()返回新的数据集

In [97]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
    ...:                     index=['Ohio', 'Colorado', 'New York'],
    ...:                     columns=['one', 'two', 'three', 'four'])

In [98]: data
Out[98]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [100]: data.rename(index=str.title,columns=str.upper)
Out[100]:
          ONE  TWO  THREE  FOUR
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

特别说明一下,rename可以结合字典型对象实现对部分轴标签的更新:

    In [112]: data.rename(index ={'OHIO':'INDIANA'},columns={'three':'peelbk'})
    Out[112]:
              one  two  peelbk  four
    Ohio        0    1       2     3
    Colorado    4    5       6     7
    New York    8    9      10    11

如要就地对DataFrame进行修改,只需传入inplace=True。

将数据划分为不同的组

In [113]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [114]: bins = [18, 25, 35, 60, 100]   #划分的范围
In [116]: cats = pd.cut(ages,bins)

In [117]: cats
Out[117]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
In [119]: cats.codes
Out[119]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [120]: cats.categories
Out[120]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')
              
In [122]: pd.value_counts(cats)     #value_counts()是pandas.cut()的计数结果
Out[122]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

划分区间的开闭号可由right=False进行修改(表示左闭右开):

In [123]: pd.cut(ages,bins,right=False)
Out[123]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

可以通过传递一个列表或数组到labels,设置自己的面元名称:

In [124]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [125]: pd.cut(ages,bins,labels=group_names)
Out[125]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

如果向cut传入的是面元的数量而不是确切的面元边界,则它会根据数据的最小值和最大值计算等长面元。下面这个例子中,我们将一些均匀分布的数据分成四组:

In [85]: data = np.random.rand(20)
In [86]: pd.cut(data, 4, precision=2)  #precision=2表示精确到小数点后两位
Out[86]: 
[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34
, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, interval[float64]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < 
(0.76, 0.97]]

此外,qcut是一个非常类似于cut的函数,它可以根据样本分位数对数据进行面元划分。根据数据的分布情况,cut可能无法使各个面元中含有相同数量的数据点。而qcut由于使用的是样本分位数,因此可以得到大小基本相等的面元:

In [87]: data = np.random.randn(1000)  # Normally distributed

In [88]: cats = pd.qcut(data, 4)  # Cut into quartiles

In [129]: cats
Out[129]:
[(-0.00264, 0.68], (0.68, 3.714], (0.68, 3.714], (-0.00264, 0.68], (0.68, 3.714], ..., (0.68, 3.714], (-0.708, -0.00264], (-3.294, -0.708], (0.68, 3.714], (-0.708, -0.00264]]
Length: 1000
Categories (4, interval[float64]): [(-3.294, -0.708] < (-0.708, -0.00264] < (-0.00264, 0.68] <
                                    (0.68, 3.714]]

In [90]: pd.value_counts(cats)
Out[90]:
(0.62, 3.928]       250
(-0.0265, 0.62]     250
(-0.68, -0.0265]    250
(-2.95, -0.68]      250
dtype: int64

与cut类似,也可以传递自定义的分位数(0到1之间的数值,包含端点):

In [133]: pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out[133]:
[(-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], (-0.00264, 1.363], ..., (-0.00264, 1.363], (-1.251, -0.00264], (-1.251, -0.00264], (-0.00264, 1.363], (-1.251, -0.00264]]
Length: 1000
Categories (4, interval[float64]): [(-3.294, -1.251] < (-1.251, -0.00264] < (-0.00264, 1.363] <
                                    (1.363, 3.714]]

检测和过滤异常值

找出绝对值大于2的

In [142]: data = pd.DataFrame(np.random.randn(100, 4))

In [143]: data[np.abs(data)>2]
Out[143]:
           0         1         2         3
0        NaN       NaN       NaN       NaN
1        NaN       NaN       NaN       NaN
2        NaN       NaN       NaN       NaN
3        NaN       NaN       NaN       NaN
4        NaN       NaN       NaN       NaN
5  -2.782408       NaN       NaN       NaN
6        NaN       NaN       NaN       NaN
7        NaN       NaN       NaN       NaN
8        NaN       NaN -2.473652       NaN
9        NaN       NaN       NaN       NaN
10       NaN       NaN       NaN       NaN
11       NaN       NaN       NaN       NaN
12       NaN  2.231629       NaN       NaN
13       NaN       NaN       NaN       NaN
14       NaN       NaN       NaN       NaN
15       NaN       NaN       NaN       NaN
16       NaN       NaN       NaN       NaN
17       NaN       NaN       NaN       NaN
18       NaN       NaN       NaN -2.082585
19       NaN       NaN       NaN       NaN
20       NaN       NaN       NaN       NaN
21       NaN       NaN       NaN       NaN
22       NaN       NaN       NaN       NaN
23       NaN       NaN       NaN       NaN
24       NaN       NaN       NaN       NaN
25       NaN       NaN       NaN       NaN
26       NaN       NaN       NaN       NaN
27       NaN       NaN       NaN       NaN
28       NaN       NaN       NaN       NaN
29       NaN       NaN       NaN       NaN
..       ...       ...       ...       ...
70       NaN       NaN       NaN       NaN
71       NaN       NaN       NaN       NaN
72       NaN       NaN       NaN       NaN
73       NaN       NaN       NaN       NaN
74       NaN       NaN       NaN       NaN
75       NaN -2.974081       NaN       NaN
76       NaN       NaN       NaN       NaN
77       NaN       NaN       NaN       NaN
78       NaN       NaN       NaN       NaN
79       NaN       NaN       NaN       NaN
80       NaN       NaN       NaN       NaN
81       NaN       NaN       NaN       NaN
82       NaN       NaN       NaN       NaN
83       NaN       NaN       NaN       NaN
84       NaN       NaN       NaN       NaN
85       NaN       NaN       NaN -2.195861
86       NaN       NaN       NaN       NaN
87       NaN       NaN       NaN       NaN
88       NaN       NaN       NaN       NaN
89       NaN       NaN       NaN       NaN
90       NaN       NaN       NaN  2.260145
91       NaN       NaN       NaN       NaN
92       NaN  2.260530       NaN       NaN
93       NaN       NaN       NaN       NaN
94       NaN       NaN       NaN       NaN
95       NaN       NaN       NaN       NaN
96       NaN       NaN       NaN       NaN
97       NaN       NaN       NaN       NaN
98       NaN       NaN       NaN       NaN
99  2.957832       NaN       NaN       NaN

找出绝对值大于2的

[100 rows x 4 columns]
In [146]: data[(np.abs(data) > 2).any(1)]
Out[146]:
           0         1         2         3
5  -2.782408 -0.987358 -1.551619 -1.526386
8   0.724653  0.384637 -2.473652  0.813940
12 -1.283665  2.231629 -0.998333  0.429922
18  0.178538 -1.130435 -0.014909 -2.082585
32  0.866329  0.374657  2.587866 -0.488257
64  0.231327  0.624478 -2.408233  1.623262
66 -1.283825 -0.406443  2.732196 -0.789880
67 -0.795824 -2.844512 -0.632061 -0.129175
68  1.078212  1.464731 -0.607920  2.617917
75  1.304676 -2.974081  0.262357  0.519151
85 -0.027685 -0.542753  1.026821 -2.195861
90 -1.313300  0.146344 -1.850061  2.260145
92 -0.179965  2.260530  0.064706 -0.491307
99  2.957832 -0.382531 -0.679123 -1.211802

根据数据的值是正还是负,np.sign(data)可以生成1和-1:

In [147]: np.sign(data).head()    #只输出前五行
Out[147]:
     0    1    2    3
0  1.0 -1.0  1.0 -1.0
1 -1.0  1.0  1.0 -1.0
2  1.0  1.0  1.0 -1.0
3 -1.0 -1.0 -1.0  1.0
4 -1.0 -1.0  1.0 -1.0

排列和随机采样

利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation,可产生一个表示新顺序的整数数组

计算指标/哑变量

另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量”或“指标矩阵”。(特征因子化)
如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)。pandas有一个get_dummies函数可以实现该功能(其实自己动手做一个也不难)。使用之前的一个DataFrame例子:

In [157]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
In [158]: df
Out[158]:
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

In [159]: pd.get_dummies(df['key'],prefix='key')   #prefix表示给指标加上一个前缀名
Out[159]:
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0

3、字符串操作

In [160]: val = 'a,b,    guido'

In [161]: val.split(',')
Out[161]: ['a', 'b', '    guido']

split常常与strip一起使用,以去除空白符(包括换行符):

In [165]: arr = [x.strip() for x in val.split(',')]

In [166]: arr
Out[166]: ['a', 'b', 'guido']

用join()方法连接字符串:

In [167]: '::'.join(arr)
Out[167]: 'a::b::guido'

in关键字判断一个字符串是否包含在另一个字符串中:

In [168]: 'guido' in arr
Out[168]: True

index()和find()判断一个子字符串的位置

In [169]: val.index(',')
Out[169]: 1

In [170]: val.find(':')
Out[170]: -1

注意find和index的区别:如果找不到字符串,index将会引发一个异常(而不是返回-1):

In [172]: val.index(':')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-172-2c016e7367ac> in <module>()
----> 1 val.index(':')

ValueError: substring not found

与此相关,count()方法可以返回指定子串的出现次数:

In [173]: val.count(',')
Out[173]: 2

replace()方法替换子字符串:

In [174]: val.replace(',','::')
Out[174]: 'a::b::    guido'

正则表达式

正则表达式提供了一种灵活的在文本中搜索或匹配(通常比前者复杂)字符串模式的方式。正则表达式,常称作regex,是根据正则表达式语言编写的字符串。Python内置的re模块负责对字符串应用正则表达式,re模块一般的操作分为三类:模式匹配、替换和拆分。
们先来看一个简单的例子:假设我想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的regex是\s+:

In [175]: import re

In [176]: text = 'foo    bar\t baz    \tqux'

In [177]: re.split('\s+',text)
Out[177]: ['foo', 'bar', 'baz', 'qux']

这章先放放,明天再写

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值