Data Cleaning and Preparation 数据清洗和准备
修改之后,增加代码,注释
xiaoyao
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd. options. display. max_rows
pd. options. display. max_rows = 20
np. random. seed( 12345 )
import matplotlib. pyplot as plt
plt. rc( 'figure' , figsize= ( 10 , 6 ) )
np. set_printoptions( precision= 4 , suppress= True )
import warnings
warnings. filterwarnings( 'ignore' )
Handling Missing Data 处理缺失数据
pandas对象的所有描述性统计默认都不包括缺失数据。pandas使用浮点值NaN(Not a Number)表示缺失数据。
string_data = pd. Series( [ 'aardvark' , 'artichoke' , np. nan, 'avocado' ] )
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
在pandas中,采用的是R语言中的惯用法,将缺失值表示为NA,他表示不可用not available.在统计应用中,NA数据可能是不存在的数据或者虽然存在,但是没有观察到(例如,数据采集中发生了问题)。
当进行数据清洗以进行分析的时候,最好直接对缺失数据进行分析,从而判断数据采集的问题或者缺失数据可能导致的偏差。 python内置的None值在对象数组中也可以作为NA
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
string_data[ 0 ] = None
string_data. isnull( )
0 True
1 False
2 True
3 False
dtype: bool
一些关于缺失数据处理的函数
方法 说明 dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可以通过阈值调节对缺失值的容忍度 fillna 用指定值或者插值方法(如ffill或者bfill)填充确实数据 isnull 返回一个含有布尔值的对象,这些布尔值表示哪些值为缺失值NA,该对象的类型与源类型一样 notnull 这个是isnull的否定形式
Filtering Out Missing Data 滤除缺失数据
过滤掉缺失数据的方式有很多种。可以通过pandas.isnull或者布尔索引的方式,但dropna可能会更加实用。对于一个Series,dropna返回一个仅仅含有非空数据和索引值的Series:
from numpy import nan as NA
data = pd. Series( [ 1 , NA, 3.5 , NA, 7 ] )
data. dropna( )
0 1.0
2 3.5
4 7.0
dtype: float64
data[ data. notnull( ) ]
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame对象,事情变得不一样。他这里默认丢弃任何含有缺失值的行。
data = pd. DataFrame( [ [ 1 . , 6.5 , 3 . ] , [ 1 . , NA, NA] ,
[ NA, NA, NA] , [ NA, 6.5 , 3 . ] ] )
cleaned = data. dropna( )
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
cleaned
data. dropna( how= 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
data[ 4 ] = NA
data
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN
data. dropna( axis= 1 , how= 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
另外一个滤除DataFrame行的问题所涉及时间序列数据。加入我只想留下一部分观测数据,可以采用thresh参数 实现此目的。
df = pd. DataFrame( np. random. randn( 7 , 3 ) )
df
0 1 2 0 0.476985 3.248944 -1.021228 1 -0.577087 0.124121 0.302614 2 0.523772 0.000940 1.343810 3 -0.713544 -0.831154 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df[ 0 ]
0 0.476985
1 -0.577087
2 0.523772
3 -0.713544
4 -1.860761
5 -1.265934
6 0.332883
Name: 0, dtype: float64
df. iloc[ : 4 , 1 ] = NA
df. iloc[ : 2 , 2 ] = NA
df
0 1 2 0 0.476985 NaN NaN 1 -0.577087 NaN NaN 2 0.523772 NaN 1.343810 3 -0.713544 NaN -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df. dropna( )
0 1 2 0 0.476985 3.248944 -1.021228 1 -0.577087 0.124121 0.302614 2 0.523772 0.000940 1.343810 3 -0.713544 -0.831154 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df. dropna( thresh= 2 )
0 1 2 0 0.476985 3.248944 -1.021228 1 -0.577087 0.124121 0.302614 2 0.523772 0.000940 1.343810 3 -0.713544 -0.831154 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
Filling In Missing Data 填充缺失数据
不滤除缺失数据,我希望通过其他的方法来填补这些“空洞”,对于大多数情况而言,fillna方法是主要的函数。通过一个常数 调用fillna就会将缺失值替换为那个常数值:
df
0 1 2 0 0.476985 NaN NaN 1 -0.577087 NaN NaN 2 0.523772 NaN 1.343810 3 -0.713544 NaN -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df. fillna( 0 )
0 1 2 0 0.476985 0.000000 0.000000 1 -0.577087 0.000000 0.000000 2 0.523772 0.000000 1.343810 3 -0.713544 0.000000 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df. fillna( { 1 : 0.5 , 2 : 0 } )
0 1 2 0 0.476985 0.500000 0.000000 1 -0.577087 0.500000 0.000000 2 0.523772 0.500000 1.343810 3 -0.713544 0.500000 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
fillna默认会返回新对象,但是也可以实现对现有的对象进行就地修改
_ = df. fillna( 0 , inplace= True )
df
0 1 2 0 0.476985 0.000000 0.000000 1 -0.577087 0.000000 0.000000 2 0.523772 0.000000 1.343810 3 -0.713544 0.000000 -2.370232 4 -1.860761 -0.860757 0.560145 5 -1.265934 0.119827 -1.063512 6 0.332883 -2.359419 -0.199543
df = pd. DataFrame( np. random. randn( 6 , 3 ) )
df
0 1 2 0 0.862580 -0.010032 0.050009 1 0.670216 0.852965 -0.955869 2 -0.023493 -2.304234 -0.652469 3 -1.218302 -1.332610 1.074623 4 0.723642 0.690002 1.001543 5 -0.503087 -0.622274 -0.921169
df. iloc[ 2 : , 1 ] = NA
df. iloc[ 4 : , 2 ] = NA
df
0 1 2 0 0.862580 -0.010032 0.050009 1 0.670216 0.852965 -0.955869 2 -0.023493 NaN -0.652469 3 -1.218302 NaN 1.074623 4 0.723642 NaN NaN 5 -0.503087 NaN NaN
data = pd. Series( [ 1 . , NA, 3.5 , NA, 7 ] )
data. mean( )
3.8333333333333335
data. fillna( data. mean( ) )
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
关于fillna参数的说明
value 用于填充缺失值的标量值或者字典对象 method 插值方式,如果函数调用时候没有进行指定,则默认为“ffill” axis 待填充的轴,默认为axis=0 inplace 修改调用者对象而不产生副本,就地修改 limit (对于前向和后向填充)可以连续填充的最大数量
Data Transformation 数据转换
到此之前都是进行的为:数据的重排,另一类重要的操作为:通过过滤,清理以及其他的转换工作。
Removing Duplicates 移除重复的数据
data = pd. DataFrame( { 'k1' : [ 'one' , 'two' ] * 3 + [ 'two' ] ,
'k2' : [ 1 , 1 , 2 , 3 , 3 , 4 , 4 ] } )
data
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
data. duplicated( )
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data. drop_duplicates( )
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
data
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
data[ 'v1' ] = range ( 7 )
data
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5 6 two 4 6
data. drop_duplicates( [ 'k1' ] )
data. drop_duplicates( [ 'k1' , 'k2' ] , keep= 'last' )
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 6 two 4 6
Transforming Data Using a Function or Mapping
利用函数或者映射进行数据转换
对于许多数据集,可能希望根据数组、Series或者DataFrame列中的值来实现转换工作,我们接下来:
data = pd. DataFrame( { 'food' : [ 'bacon' , 'pulled pork' , 'bacon' ,
'Pastrami' , 'corned beef' , 'Bacon' ,
'pastrami' , 'honey ham' , 'nova lox' ] ,
'ounces' : [ 4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ] } )
data
food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
meat_to_animal = {
'bacon' : 'pig' ,
'pulled pork' : 'pig' ,
'pastrami' : 'cow' ,
'corned beef' : 'cow' ,
'honey ham' : 'pig' ,
'nova lox' : 'salmon'
}
"""
有些肉类的首字母大写了,而另一些没有,
因此,首先调用Series的str.lower方法,将各个值转换为小写:
"""
lowercased = data[ 'food' ] . str . lower( )
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data[ 'animal' ] = lowercased. map ( meat_to_animal)
data
food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon
也可以传入一个可以完成全部工作的函数,这里使用匿名函数
data[ 'food' ] . map ( lambda x: meat_to_animal[ x. lower( ) ] )
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
data
food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon
Replacing Values 替换值
利用fillna方法填充缺失数据可以看作是替换值的一种特殊方法。前面已经看到,map可以用于修改对象的数据子集。
而replace则提供了以中国实现该功能的更加简单、灵活的方式。
data = pd. Series( [ 1 . , - 999 . , 2 . , - 999 . , - 1000 . , 3 . ] )
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data. replace( - 999 , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , [ np. nan, 0 ] )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data. replace( { - 999 : np. nan, - 1000 : 0 } )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace方法与data.str.replace不同,后者做的是字符串的元素级替换,
Renaming Axis Indexes 重命名轴索引
data = pd. DataFrame( np. arange( 12 ) . reshape( ( 3 , 4 ) ) ,
index= [ 'Ohio' , 'Colorado' , 'New York' ] ,
columns= [ 'one' , 'two' , 'three' , 'four' ] )
跟Series中的值一样,轴标签也可以通过函数或者映射进行转换,从而得到一个新的不同标签的对象。轴还可以被就地修改,而无需新建一个数据结构。
data
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 New York 8 9 10 11
transform = lambda x: x[ : 4 ] . upper( )
data. index. map ( transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data. index = data. index. map ( transform)
data
one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data. rename( index= str . title, columns= str . upper)
ONE TWO THREE FOUR Ohio 0 1 2 3 Colo 4 5 6 7 New 8 9 10 11
data. rename( index= { 'OHIO' : 'INDIANA' } ,
columns= { 'three' : 'peekaboo' } )
one two peekaboo four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data. rename( index= { 'OHIO' : 'INDIANA' } , inplace= True )
data
one two three four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
Discretization and Binning 离散化和面元划分
为了便于分析,连续的数据常常被离散化或者拆分为"面元(bin)".
如下:假设有一组人员数据,希望将其划分为不同的年龄组:
ages = [ 20 , 22 , 25 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ]
bins = [ 18 , 25 , 35 , 60 , 100 ]
cats = pd. cut( ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut划分的面元。可以将其看作一组表示面元名称的字符串。
它的底层含有一个表示不同分类名称的类型数组,以及一个codes属性中的年龄数据的标签。
cats. codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats. categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd. value_counts( cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
pd. cut( ages, [ 18 , 26 , 36 , 61 , 100 ] , right= False )
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = [ 'Youth' , 'YoungAdult' , 'MiddleAged' , 'Senior' ]
pd. cut( ages, bins, labels= group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
data = np. random. rand( 20 )
pd. cut( data, 4 , precision= 2 )
[(0.49, 0.72], (0.02, 0.26], (0.02, 0.26], (0.49, 0.72], (0.49, 0.72], ..., (0.49, 0.72], (0.49, 0.72], (0.26, 0.49], (0.72, 0.96], (0.49, 0.72]]
Length: 20
Categories (4, interval[float64]): [(0.02, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.96]]
这里的选项precision=2,限定小数只有两位
data = np. random. randn( 1000 )
cats = pd. qcut( data, 4 )
cats
[(-0.0453, 0.604], (-2.9499999999999997, -0.686], (-0.0453, 0.604], (-0.0453, 0.604], (-2.9499999999999997, -0.686], ..., (-0.686, -0.0453], (0.604, 3.928], (0.604, 3.928], (-0.0453, 0.604], (-0.686, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.686] < (-0.686, -0.0453] < (-0.0453, 0.604] < (0.604, 3.928]]
pd. value_counts( cats)
(0.604, 3.928] 250
(-0.0453, 0.604] 250
(-0.686, -0.0453] 250
(-2.9499999999999997, -0.686] 250
dtype: int64
pd. qcut( data, [ 0 , 0.1 , 0.5 , 0.9 , 1 . ] )
[(-0.0453, 1.289], (-1.191, -0.0453], (-0.0453, 1.289], (-0.0453, 1.289], (-2.9499999999999997, -1.191], ..., (-1.191, -0.0453], (1.289, 3.928], (1.289, 3.928], (-0.0453, 1.289], (-1.191, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -1.191] < (-1.191, -0.0453] < (-0.0453, 1.289] < (1.289, 3.928]]
Detecting and Filtering Outliers 检测和过滤异常值
data = pd. DataFrame( np. random. randn( 1000 , 4 ) )
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean -0.043288 0.046433 0.026352 -0.010204 std 0.998391 0.999185 1.010005 0.992779 min -3.428254 -3.645860 -3.184377 -3.745356 25% -0.740152 -0.599807 -0.612162 -0.699863 50% -0.085000 0.043663 -0.008168 -0.031732 75% 0.625698 0.746527 0.690847 0.692355 max 3.366626 2.653656 3.525865 2.735527
col = data[ 2 ]
col[ np. abs ( col) > 3 ]
50 3.260383
225 -3.056990
312 -3.184377
772 3.525865
Name: 2, dtype: float64
data[ ( np. abs ( data) > 3 ) . any ( 1 ) ]
0 1 2 3 31 -2.315555 0.457246 -0.025907 -3.399312 50 0.050188 1.951312 3.260383 0.963301 126 0.146326 0.508391 -0.196713 -3.745356 225 -0.293333 -0.242459 -3.056990 1.918403 249 -3.428254 -0.296336 -0.439938 -0.867165 312 0.275144 1.179227 -3.184377 1.369891 534 -0.362528 -3.548824 1.553205 -2.186301 626 3.366626 -2.372214 0.851010 1.332846 772 -0.658090 -0.207434 3.525865 0.283070 793 0.599947 -3.645860 0.255475 -0.549574
data[ np. abs ( data) > 3 ] = np. sign( data) * 3
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean -0.043227 0.047628 0.025807 -0.009059 std 0.995841 0.995170 1.006769 0.988960 min -3.000000 -3.000000 -3.000000 -3.000000 25% -0.740152 -0.599807 -0.612162 -0.699863 50% -0.085000 0.043663 -0.008168 -0.031732 75% 0.625698 0.746527 0.690847 0.692355 max 3.000000 2.653656 3.000000 2.735527
np. sign( data) . head( )
0 1 2 3 0 -1.0 -1.0 -1.0 -1.0 1 -1.0 1.0 -1.0 -1.0 2 1.0 -1.0 -1.0 1.0 3 1.0 1.0 1.0 -1.0 4 1.0 1.0 1.0 1.0
Permutation and Random Sampling 排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或者DataFrame的列的排列工作(permuting,随机重排序)。通过对需要排列的轴的长度调用permutation,可以产生一个表示新顺序的整数数组。
df = pd. DataFrame( np. arange( 5 * 4 ) . reshape( ( 5 , 4 ) ) )
sampler = np. random. permutation( 5 )
sampler
array([2, 0, 3, 4, 1])
df
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
df. take( sampler)
0 1 2 3 2 8 9 10 11 0 0 1 2 3 3 12 13 14 15 4 16 17 18 19 1 4 5 6 7
df. sample( n= 3 )
choices = pd. Series( [ 5 , 7 , - 1 , 6 , 4 ] )
draws = choices. sample( n= 10 , replace= True )
draws
4 4
4 4
1 7
3 6
4 4
3 6
4 4
4 4
3 6
2 -1
dtype: int64
Computing Indicator/Dummy Variables 计算指标/哑变量
另一种常用于统计建模或机器学习的转换方式是:将分类变量(类别型变量)转换为"哑变量"或者"指标矩阵"
df = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ] ,
'data1' : range ( 6 ) } )
pd. get_dummies( df[ 'key' ] )
a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
df
key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5
如果,DataFrame的某一列中含有k各不同的值,则可以派生出一个k列的矩阵或者DataFrame(其值全为1和0)
"""
有时候,可能想给指标DataFrame的列加上一个前缀,以便于能够跟其他的数据进行合并。
get_dummies的prefix参数可以实现该功能。
"""
dummies = pd. get_dummies( df[ 'key' ] , prefix= 'key' )
df_with_dummy = df[ [ 'data1' ] ] . join( dummies)
df_with_dummy
data1 key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0
mnames = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( 'datasets/movielens/movies.dat' , sep= '::' ,
header= None , names= mnames)
movies[ : 10 ]
movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy 5 6 Heat (1995) Action|Crime|Thriller 6 7 Sabrina (1995) Comedy|Romance 7 8 Tom and Huck (1995) Adventure|Children's 8 9 Sudden Death (1995) Action 9 10 GoldenEye (1995) Action|Adventure|Thriller
all_genres = [ ]
for x in movies. genres:
all_genres. extend( x. split( '|' ) )
genres = pd. unique( all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np. zeros( ( len ( movies) , len ( genres) ) )
dummies = pd. DataFrame( zero_matrix, columns= genres)
gen = movies. genres[ 0 ]
gen. split( '|' )
dummies. columns. get_indexer( gen. split( '|' ) )
array([0, 1, 2], dtype=int64)
for i, gen in enumerate ( movies. genres) :
indices = dummies. columns. get_indexer( gen. split( '|' ) )
dummies. iloc[ i, indices] = 1
movies_windic = movies. join( dummies. add_prefix( 'Genre_' ) )
movies_windic. iloc[ 0 ]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
...
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, Length: 21, dtype: object
对于很大的数据,用这种方法构建多成员指标变量就会变得非常慢,最好使用更加低级的函数,将其写入到Numpy数组,然后将结果包装在DataFrame中。
np. random. seed( 12345 )
values = np. random. rand( 10 )
values
array([0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 0.6532,
0.7489, 0.6536])
bins = [ 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ]
pd. get_dummies( pd. cut( values, bins) )
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] 0 0 0 0 0 1 1 0 1 0 0 0 2 1 0 0 0 0 3 0 1 0 0 0 4 0 0 1 0 0 5 0 0 1 0 0 6 0 0 0 0 1 7 0 0 0 1 0 8 0 0 0 1 0 9 0 0 0 1 0
String Manipulation 字符串操纵
python本身能够处理字符串和文本,对于更加复杂的模式匹配和文本操作,就需要使用到正则表达式。pandas对此进行了加强,可以实现对:整租数据应用字符串表达式和正则表达式,而且可以处理烦人的缺失数据。
String Object Methods 字符串对象方法
val = 'a,b, guido'
val. split( ',' )
['a', 'b', ' guido']
pieces = [ x. strip( ) for x in val. split( ',' ) ]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::' . join( pieces)
'a::b::guido'
'guido' in val
True
val. index( ',' )
1
val. find( ':' )
-1
find和index的区别是;如果找不到字符串,index将会引发一个异常,而不是返回-1
val. index( ':' )
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-108-2c016e7367ac> in <module>
----> 1 val.index(':')
ValueError: substring not found
val. count( ',' )
2
val. replace( ',' , '::' )
'a::b:: guido'
val. replace( ',' , '' )
'ab guido'
python内置的字符串方法
方法 说明 count 返回字串在字符串中出现的次数(非重叠) endswith 字符串是否以某个后缀结尾,是则返回True startswith 字符串是否以某个前缀开头,是则返回True find, rfind 如果在字符串中找到字串,则返回第一次出现的位置,没有发现则返回-1,,后者返回最后一个发现的位置
Regular Expressions 正则表达式
re模块的函数可以分为三个大类:模式匹配,替换以及拆分
import re
text = "foo bar\t baz \tqux"
re. split( '\s+' , text)
['foo', 'bar', 'baz', 'qux']
调用re.split(’\s+’,text)的时候,正则表达式会先被编译,然后会在text上调用其split方法。
regex = re. compile ( '\s+' )
regex. split( text)
['foo', 'bar', 'baz', 'qux']
regex. findall( text)
[' ', '\t ', ' \t']
如果想避免正则表达式中不需要的转移 (\),则可以使用原始字符串字面量如:
r’C:\x’
如果打算对许多字符串应用同一条正则表达式,建议通过re.compile创建regex对象。这样子可以节省大量的cpu时间
findall返回的是:字符串中所有的匹配项,而search则只返回第一个匹配项。match则更加严格,仅仅匹配字符串的首部。
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re. compile ( pattern, flags= re. IGNORECASE)
regex. findall( text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
m = regex. search( text)
m
<re.Match object; span=(5, 20), match='dave@google.com'>
text[ m. start( ) : m. end( ) ]
'dave@google.com'
print ( regex. match( text) )
None
print ( regex. sub( 'REDACTED' , text) )
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re. compile ( pattern, flags= re. IGNORECASE)
m = regex. match( 'wesm@bright.net' )
m. groups( )
('wesm', 'bright', 'net')
regex. findall( text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print ( regex. sub( r'Username: \1, Domain: \2, Suffix: \3' , text) )
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
Vectorized String Functions in pandas pandas的矢量化字符串函数
data = { 'Dave' : 'dave@google.com' , 'Steve' : 'steve@gmail.com' ,
'Rob' : 'rob@gmail.com' , 'Wes' : np. nan}
data = pd. Series( data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data. isnull( )
Dave False
Steve False
Rob False
Wes True
dtype: bool
data. str . contains( 'gmail' )
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern
data. str . findall( pattern, flags= re. IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
matches = data. str . match( pattern, flags= re. IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
matches. str [ 0 ]
data. str [ : 5 ]
pd. options. display. max_rows = PREVIOUS_MAX_ROWS
Conclusion