python清洗数据 food ounces animal_python+数据过滤、清理、转换

本篇博客主要内容

1)移除重复数据(duplicated)

2)利用函数或映射进行数据转换(map)

3)替换值(replace)

4)重命名轴索引

5)检测和过滤异常值(逻辑索引)

6)随机采样或选出随机子集

7)计算哑变量(get_dummies)

1)移除重复数据

检查某列数据是否重复可用.is_unique

检查某行数据是否重复可用.duplicatedimport pandas as pd

import numpy as np

data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1,1,2,3,3,4,4]})data

k1k20one1

1one1

2one2

3two3

4two3

5two4

6two4

检查列以及行中重复数据data.index.is_unique#检查列Truedata.k1.is_unique#检查列Falsedata['k2'].is_unique#检查列Falsedata.is_unique---------------------------------------------------------------------------

AttributeError Traceback (most recent call last)

in ()

----> 1 data.is_unique

C:\Program Files\anaconda\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)

2670 if name in self._info_axis:

2671 return self[name]

-> 2672 return object.__getattribute__(self, name)

2673

2674 def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'is_unique'data.duplicated()#检查行0 False

1 True

2 False

3 False

4 True

5 False

6 True

dtype: bool

移除重复行data.drop_duplicates()

k1k20one1

2one2

3two3

5two4set(data.k1)#保留唯一的列属性值{'one', 'two'}

移除重复值小结

1) drop_duplicates、duolicated函数只能用于DataFrame

2) is_unique不能用于DataFrame

2)利用函数或映射进行数据转换data1 = pd.DataFrame({'food':['bacon','pork','bacon','Pastrami', 'beef','Bacon','pastrami','ham','lox'], 'ounces':[4,3,12,6,7.5,8,3,5,6]})

data1

foodounces0bacon4.0

1pork3.0

2bacon12.0

3Pastrami6.0

4beef7.5

5Bacon8.0

6pastrami3.0

7ham5.0

8lox6.0

添加一列表示肉类来源的动物类型#step1:构建肉类到动物的映射

meat_to_animal = {'bacon':'pig','pork':'pig','pastrami':'cow','beef':'cow','ham':'pig', 'lox':'salmon'}

Series的map方法可以接受一个函数或含有映射关系的字典型对象,字符的大小写要一致#step2:映射

data1['animal'] = data1['food'].map(str.lower).map(meat_to_animal)

data1

foodouncesanimal0bacon4.0pig

1pork3.0pig

2bacon12.0pig

3Pastrami6.0cow

4beef7.5cow

5Bacon8.0pig

6pastrami3.0cow

7ham5.0pig

8lox6.0salmon#step2的另一种实现方法

data1['food'].map(lambda x:meat_to_animal[x.lower()])0 pig

1 pig

2 pig

3 cow

4 cow

5 pig

6 cow

7 pig

8 salmon

Name: food, dtype: objectdata1

foodouncesanimal0bacon4.0pig

1pork3.0pig

2bacon12.0pig

3Pastrami6.0cow

4beef7.5cow

5Bacon8.0pig

6pastrami3.0cow

7ham5.0pig

8lox6.0salmon

map是一种实现元素级转换记忆其他数据清理工作的便捷方式

map会改变原始数据集

3)替换值

替换缺失值的方法:

1)fillna

2)含有重复索引的合并combine_first

3)replacedata2 = pd.Series([1.,-999,2,-999,-1000,3.])

data20 1.0

1 -999.0

2 2.0

3 -999.0

4 -1000.0

5 3.0

dtype: float64

-999可能是一个表示缺失数据的标记值,要将其替换为pandas能够理解的NA值,可以利用replacedata2.replace(-999,np.nan)0 1.0

1 NaN

2 2.0

3 NaN

4 -1000.0

5 3.0

dtype: float64data20 1.0

1 -999.0

2 2.0

3 -999.0

4 -1000.0

5 3.0

dtype: float64

replace不改变原数据集

一次性替换多个值data2.replace([-999,-1000],np.nan)#一次传入一个列表即可0 1.0

1 NaN

2 2.0

3 NaN

4 NaN

5 3.0

dtype: float64data2.replace([-999,-1000],[np.nan,0])0 1.0

1 NaN

2 2.0

3 NaN

4 0.0

5 3.0

dtype: float64data2.replace({-999:np.nan,-1000:0})0 1.0

1 NaN

2 2.0

3 NaN

4 0.0

5 3.0

dtype: float64

4)重命名轴索引

跟Series中的值一样,轴标签也可以通过函数或映射进行转换,从而得到一个新对象,轴还可以被就地修改,而无需新建一个数据结构data3 = pd.DataFrame(np.arange(12).reshape(3,4),index = ['a','b','c'],columns = ['one','two','three','four'])data3

onetwothreefoura0123

b4567

c891011data3.index.map(str.upper)array(['A', 'B', 'C'], dtype=object)data3

onetwothreefoura0123

b4567

c891011data3.index = data3.index.map(str.upper)#修改了data3

onetwothreefourA0123

B4567

C891011

还可以通过rename结合字典型对象实现对部分轴标签的更新data3.rename(index = {'A':'aaa'},columns = {'three':'liu'})

onetwoliufouraaa0123

B4567

C891011data3#不改变原数据

onetwothreefourA0123

B4567

C891011data3 = data3.rename(index = {'A':'aaa'},columns = {'three':'liu'})data3

onetwoliufouraaa0123

B4567

C891011

5)检测和过滤异常值

这里的异常值的阈值已知,因此,异常值的过滤或变换运算很大程度上其实就是逻辑数组运算。data4 = pd.DataFrame(np.random.randn(1000,4))data4.info()data4.describe()

0123count1000.0000001000.0000001000.0000001000.000000

mean0.023986-0.0140490.032299-0.037661

std0.9945711.0035221.0099391.017361

min-3.526970-3.298974-3.429383-3.421995

25%-0.632426-0.685564-0.665548-0.756219

50%0.0133260.006130-0.017911-0.015297

75%0.6332790.6702610.6738490.665360

max3.5496203.1425033.9910283.086376

找出某列绝对值大于3的值data4[3][np.abs(data4[3]) > 3]189 -3.421995

335 3.086376

590 -3.388477

778 -3.100379

Name: 3, dtype: float64

找出全部或含有“超过3或-3的值”的行(np.abs(data4) > 3).any(1).head()0 False

1 False

2 False

3 False

4 False

dtype: booldata4[(np.abs(data4) > 3).any(1)]

01231093.549620-0.943976-0.0584900.941503

189-0.071249-1.3503610.385375-3.421995

2912.3379613.142503-0.208999-0.485979

3350.230998-1.3972592.7342293.086376

447-3.526970-0.2894671.0994871.206039

4640.011728-0.3987393.1044700.459924

5460.3579440.0070633.9910280.722481

573-3.019947-0.982651-1.7272891.484966

5900.2110690.3440590.656351-3.388477

6600.9301033.117643-1.372034-1.208730

6630.362668-3.298974-1.0331280.900985

7780.0941720.8279372.617724-3.100379

814-1.450645-1.131513-3.429383-0.828139

8531.188536-3.069987-0.7467000.745037

8992.4490300.4299593.025705-1.571179

替换异常值data4[np.abs(data4) > 3] = np.sign(data) * 3data4.isnull().sum()#有空值0 3

1 4

2 4

3 4

dtype: int64data4 = data4.replace(np.nan,0)data4.isnull().sum()#无空值0 0

1 0

2 0

3 0

dtype: int64data4.describe()###?????????

0123count1000.0000001000.0000001000.0000001000.000000

mean0.026983-0.0139410.025608-0.030836

std0.9771520.9834210.9868310.996554

min-2.749595-2.799638-2.943564-2.743207

25%-0.630318-0.682237-0.663014-0.739291

50%0.0124450.000613-0.017171-0.004484

75%0.6311460.6680230.6602360.659204

max2.8298042.9150312.9076552.679495

6)排列和随机采样

1)numpy.random.permutation函数

2)np.random.randint生成随机数df = pd.DataFrame(np.arange(5 *4).reshape(5,4))

sampler = np.random.permutation(5)df

012300123

14567

2891011

312131415

416171819samplerarray([0, 1, 3, 4, 2])df.ix[sampler]

012300123

14567

312131415

416171819

2891011df.take(sampler)

012300123

14567

312131415

416171819

2891011

通过np.random.randint得到一组随机整数sampler1 = np.random.randint(0,len(df),size = 4)sampler1array([2, 2, 3, 0])df1 = df.take(sampler1)df1

01232891011

2891011

312131415

00123

7)计算指标/哑变量

将分类变量(categorical variable)转换为(哑变量矩阵,dummy matrix)或(指标矩阵,indicator matrix)是常用于统计学习建模或机器学习的转换方式。

即 DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值为1或0)。

pandas中的get_dummies函数可以实现以上功能df2 = pd.DataFrame({'key':['b','a','b','c','a','b'],'data1':range(6)})

df2

data1key00b

11a

22b

33c

44a

55bpd.get_dummies(df2.key)

abc00.01.00.0

11.00.00.0

20.01.00.0

30.00.01.0

41.00.00.0

50.01.00.0pd.get_dummies(df2['key'],prefix = 'key')

key_akey_bkey_c00.01.00.0

11.00.00.0

20.01.00.0

30.00.01.0

41.00.00.0

50.01.00.0## get_dummies矩阵和原数据连接

dummies = pd.get_dummies(df2['key'],prefix = 'key')

pd.concat([df2['data1'],dummies],axis = 1)

data1key_akey_bkey_c000.01.00.0

111.00.00.0

220.01.00.0

330.00.01.0

441.00.00.0

550.01.00.0df2[['data1']].join(dummies)#Series没有join

data1key_akey_bkey_c000.01.00.0

111.00.00.0

220.01.00.0

330.00.01.0

441.00.00.0

550.01.00.0df2[['data1']]#选出的是DataFrame

data100

11

22

33

44

55df2['data1']#选出的是Series0 0

1 1

2 2

3 3

4 4

5 5

Name: data1, dtype: int64

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值