4.2 数据转换
下面主要说一下利用pandas
对数据进行,过滤、清洗以及其他转换是另外一系列重要的操作。
删除重复值
DataFrame
的duplicated
方法返回的是一个布尔值Series
,这个Series
反映的是每一行是否存在重复(与之前出现过的行相同)情况:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: data = pd.DataFrame({
'k1' : ['one', 'two'] * 3 + ['two'], 'k2' : [1,1,2,3,3,4,4,
...: ]})
In [4]: data
Out[4]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [5]: data.duplicated()
Out[5]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True # 只有这一行与上面的第5号元素相同 返回True
dtype: bool
drop_duplicates
返回的是DataFrame
,内容是duplicated
返回数组中为False
的部分:
In [6]: data.drop_duplicates()
Out[6]:
k1 k2
0 one 1 # 返回的结果中没有重复项
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
先生成一个额外的列,并基于’k1’
列去除重复值:
In [7]: data['v1'] = range(7)
In [8]: data
Out[8]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
In [9]: data.drop_duplicates(['k1'])
Out[9]:
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated
和drop_duplicates
默认都是保留第一个观测到的值。传入参数keep='last’
将会返回最后一个:
In [10]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[10]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4 # 5号元素被删除了
6 two 4 6
In [11]: data
Out[11]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
使用函数或映射进行数据转换
考虑下面这些收集到的关于肉类的假设数据:
data = pd.DataFrame({
'food' : ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham',
'nova lox'], 'ounces' : [4,3,12,7,7.5,8,3,5,6]})
print(data)
'''
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 7.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
'''
meat_to_animal = {
'bacon' : 'pig',
'pulled pork' : 'pig',
'pastrami' : 'cow',
'corned beef' : 'pig',
'honey ham' : 'pig',
'nova lox' : 'salmon'
}
上面的数据有的大写有的小写,则使用Series
的str.lower
方法将每个值都转换为小写:
print(data['food'])
lowercased = data['food'].str.lower()
print(lowercased)
'''
0 bacon
1 pulled pork
2 bacon
3 Pastrami
4 corned beef
5 Bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
0 bacon # 将food中的数据中的所有大写字母改为小写
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
'''
data['animal'] = lowercased.map(meat_to_animal) # 将meat_to_animal映射到data中
print(data)
'''
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 7.0 cow
4 corned beef 7.5 pig
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
'''
print(data['food'].map(lambda x : meat_to_animal[x.lower()]))
'''
0 pig
1 pig
2 pig
3 cow
4 pig
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
'''
print(data)
'''
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 7.0 cow
4 corned beef 7.5 pig
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
'''
使用map
是一种可以便捷执行按元素转换及其他清洗相关操作的方法。
替代值
In [