数据清洗和准备
处理缺失数据
import pandas as pd
import numpy as np
string_data= pd. Series( [ 'aardvark' , 'artichoke' , np. nan, 'avocado' ] )
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
对于数值数据,pandas使用浮点值NaN(Not a Number)表示缺失数据
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
即将缺失值表示为NA,它表示不可用not available。在统计应用中,NA数据可能是不存在的数据或者虽然存在,但是没 有观察到(例如,数据采集中发生了问题) Python内置的None值在对象数组中也可以作为NA:
string_data[ 0 ] = None
string_data. isnull( )
0 True
1 False
2 True
3 False
dtype: bool
关于缺失数据处理的函数: dropna :根据各标签的之值中是否存在缺失数据对轴标签进行过滤,可通过与之调节对缺失值得容忍度 fillna :用指定值或插值方法(ffill或者bfill)填充数据 isnull :返回一个含有布尔值的对象,这些对象表示哪些值是缺失值NA,该对象的类型与原类型一样 notnull :isnull的否定式
from numpy import nan as NA
data= pd. Series( [ 1 , NA, 3.5 , NA, 7 ] )
data. dropna( )
0 1.0
2 3.5
4 7.0
dtype: float64
data[ data. notnull( ) ]
0 1.0
2 3.5
4 7.0
dtype: float64
dropna默认丢弃任何含有缺失值的列
data = pd. DataFrame( [ [ 1. , 6.5 , 3. ] , [ 1. , NA, NA] , [ NA, NA, NA] , [ NA, 6.5 , 3. ] ] )
cleaned= data. dropna( )
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
cleaned
cleaned_how= data. dropna( how= 'all' )
cleaned_how
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
data[ 4 ] = NA
data
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN
data. dropna( how= 'all' , axis= 1 )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数 据,可以用thresh参数实现此目的:
thresh参数用法是:保留至少有n个非NaN数据的行/列
df= pd. DataFrame( np. random. randn( 7 , 3 ) )
df. iloc[ : 4 , 1 ] = NA
df. iloc[ : 2 , 2 ] = NA
df
0 1 2 0 1.219978 NaN NaN 1 0.341182 NaN NaN 2 0.782306 NaN 0.402269 3 0.033353 NaN 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
df. dropna( )
0 1 2 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
df. dropna( thresh= 2 )
0 1 2 2 0.782306 NaN 0.402269 3 0.033353 NaN 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
填充缺失数据fillna()
df. fillna( 0 )
0 1 2 0 1.219978 0.000000 0.000000 1 0.341182 0.000000 0.000000 2 0.782306 0.000000 0.402269 3 0.033353 0.000000 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值
df. fillna( { 1 : 0.5 , 2 : 0 } )
0 1 2 0 1.219978 0.500000 0.000000 1 0.341182 0.500000 0.000000 2 0.782306 0.500000 0.402269 3 0.033353 0.500000 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
_= df. copy( )
_
0 1 2 0 1.219978 NaN NaN 1 0.341182 NaN NaN 2 0.782306 NaN 0.402269 3 0.033353 NaN 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
_. fillna( 0 )
0 1 2 0 1.219978 0.000000 0.000000 1 0.341182 0.000000 0.000000 2 0.782306 0.000000 0.402269 3 0.033353 0.000000 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
_
0 1 2 0 1.219978 NaN NaN 1 0.341182 NaN NaN 2 0.782306 NaN 0.402269 3 0.033353 NaN 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
_. fillna( 0 , inplace= True )
_
0 1 2 0 1.219978 0.000000 0.000000 1 0.341182 0.000000 0.000000 2 0.782306 0.000000 0.402269 3 0.033353 0.000000 0.666443 4 -0.761581 -1.232945 -0.291452 5 -0.516256 -0.442507 0.850908 6 1.827264 0.286749 0.924544
df = pd. DataFrame( np. random. randn( 6 , 3 ) )
df. iloc[ 2 : , 1 ] = NA
df. iloc[ 4 : , 2 ] = NA
df
0 1 2 0 -1.029961 -0.41851 0.634309 1 -0.621635 -0.24739 0.783342 2 -1.659875 NaN -0.231234 3 0.513173 NaN -1.094123 4 1.787183 NaN NaN 5 -0.611099 NaN NaN
df. fillna( method= 'ffill' )
0 1 2 0 -1.029961 -0.41851 0.634309 1 -0.621635 -0.24739 0.783342 2 -1.659875 -0.24739 -0.231234 3 0.513173 -0.24739 -1.094123 4 1.787183 -0.24739 -1.094123 5 -0.611099 -0.24739 -1.094123
df. fillna( 0 , limit= 2 )
0 1 2 0 -1.029961 -0.41851 0.634309 1 -0.621635 -0.24739 0.783342 2 -1.659875 0.00000 -0.231234 3 0.513173 0.00000 -1.094123 4 1.787183 NaN 0.000000 5 -0.611099 NaN 0.000000
fillna
value:用于填充缺失值的标量值或字典对象 method:插值方式,默认ffill axis:待填充的轴,默认axis=0 inplace:修改调用者对象而不产生副本 limit:(对于前向和后向填充)可以连续填充的最大数量
数据转换
重复数据处理
data = pd. DataFrame( { 'k1' : [ 'one' , 'two' ] * 3 + [ 'two' ] ,
'k2' : [ 1 , 1 , 2 , 3 , 3 , 4 , 4 ] } )
data
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
DataFrame的duplicated方法返回一个布尔型Series,表示各行是否是重复行
data. duplicated( )
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data. drop_duplicates( )
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
data[ 'v1' ] = range ( 7 )
data
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5 6 two 4 6
data. drop_duplicates( [ 'k1' ] )
duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last’则 保留最后一个:
data. drop_duplicates( [ 'k1' , 'k2' ] , keep= 'last' )
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 6 two 4 6
利用函数或映射进行数据转换
data = pd. DataFrame( { 'food' : [ 'bacon' , 'pulled pork' , 'bacon' , 'Pastrami' , 'corned beef' , 'Bacon' , 'pastrami' , 'honey ham' , 'nova lox' ] ,
'ounces' : [ 4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ] } )
data
food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
meat_to_animal = {
'bacon' : 'pig' ,
'pulled pork' : 'pig' ,
'pastrami' : 'cow' ,
'corned beef' : 'cow' ,
'honey ham' : 'pig' ,
'nova lox' : 'salmon'
}
lowercased= data[ 'food' ] . str . lower( )
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data[ 'animal' ] = lowercased. map ( meat_to_animal)
data
food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon
data[ 'food' ] . map ( lambda x: meat_to_animal[ x. lower( ) ] )
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
替换值replace
data = pd. Series( [ 1. , - 999. , 2. , - 999. , - 1000. , 3. ] )
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data. replace( - 999 , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , [ np. nan, 0 ] )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data. replace( { - 999 : np. nan, - 1000 : 0 } )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
重命名轴索引
data = pd. DataFrame( np. arange( 12 ) . reshape( ( 3 , 4 ) ) ,
index= [ 'Ohio' , 'Colorado' , 'New York' ] ,
columns= [ 'one' , 'two' , 'three' , 'four' ] )
data. index. map ( lambda x: x[ : 4 ] . upper( ) )
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data. index = data. index. map ( lambda x: x[ : 4 ] . upper( ) )
data
one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是 rename
data. rename( index= str . upper, columns= str . upper)
ONE TWO THREE FOUR OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data. rename( index= { 'Ohio' : 'INDIANA' } ,
columns= { 'three' : 'peekaboo' } )
one two peekaboo four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data. rename( index= { 'Ohio' : 'indiana' } , inplace= True )
data
one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
离散化和面元划分
为了便于分析,连续数据常常被离散化或拆分为“面元”(bin),使用pandas的cut函数
ages = [ 20 , 22 , 25 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ]
bins = [ 18 , 25 , 35 , 60 , 100 ]
cats= pd. cut( ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats. codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats. categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
pd. value_counts( cats)
(18, 25] 5
(25, 35] 3
(35, 60] 3
(60, 100] 1
Name: count, dtype: int64
跟“区间”的数学符号一样,圆括号表示开端,而方括号则表示闭端(包括)。哪边 是闭端可以通过right=False进行修改
pd. cut( ages, [ 18 , 26 , 36 , 61 , 100 ] , right= False )
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
你可 以通过传递一个列表或数组到labels,设置自己的面元名称
group_names = [ 'Youth' , 'YoungAdult' , 'MiddleAged' , 'Senior' ]
pd. cut( ages, bins, labels= group_names)
['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']
data= np. random. rand( 20 )
data
array([0.50910844, 0.01886219, 0.95908375, 0.72900936, 0.88044385,
0.94608156, 0.13493984, 0.91195245, 0.46857512, 0.38525391,
0.02991488, 0.31362695, 0.15493992, 0.74873532, 0.6170826 ,
0.84356457, 0.09466064, 0.01974264, 0.97598584, 0.43164735])
如果向cut传入的是面元的数量而不是确切的面元边界,则它会根据数据的最小值和 最大值计算等长面元
pd. cut( data, 4 , precision= 2 )
[(0.5, 0.74], (0.018, 0.26], (0.74, 0.98], (0.5, 0.74], (0.74, 0.98], ..., (0.74, 0.98], (0.018, 0.26], (0.018, 0.26], (0.74, 0.98], (0.26, 0.5]]
Length: 20
Categories (4, interval[float64, right]): [(0.018, 0.26] < (0.26, 0.5] < (0.5, 0.74] < (0.74, 0.98]]
qcut 是一个非常类似于cut的函数,它可以根据样本分位数 对数据进行面元划分。根 据数据的分布情况,cut可能无法使各个面元中含有相同数量的数据点。而qcut由于 使用的是样本分位数,因此可以得到大小基本相等的面元:
data= np. random. randn( 1000 )
cats= pd. qcut( data, 4 )
cats
[(-0.601, -0.0125], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (-2.885, -0.601], ..., (-2.885, -0.601], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (0.673, 3.875]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -0.601] < (-0.601, -0.0125] < (-0.0125, 0.673] < (0.673, 3.875]]
pd. value_counts( cats)
(-2.885, -0.601] 250
(-0.601, -0.0125] 250
(-0.0125, 0.673] 250
(0.673, 3.875] 250
Name: count, dtype: int64
与cut类似,你也可以传递自定义的分位数(0到1之间的数值,包含端点)
pd. qcut( data, [ 0 , 0.1 , 0.5 , 0.9 , 1. ] )
[(-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-1.22, -0.0125], ..., (-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-0.0125, 1.303]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -1.22] < (-1.22, -0.0125] < (-0.0125, 1.303] < (1.303, 3.875]]
检测和过滤异常值
过滤或变换异常值(outlier)在很大程度上就是运用数组运算。
data= pd. DataFrame( np. random. randn( 1000 , 4 ) )
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean -0.001687 -0.036570 0.049180 0.010509 std 1.007410 0.971646 1.013227 0.982367 min -3.127882 -2.643439 -2.949846 -2.962251 25% -0.678444 -0.719395 -0.650478 -0.636513 50% 0.001463 -0.022189 0.061264 0.046104 75% 0.675910 0.629861 0.710325 0.642668 max 3.162745 4.108418 3.597951 4.410464
col= data[ 2 ]
col[ np. abs ( col) > 3 ]
565 3.597951
Name: 2, dtype: float64
data[ ( np. abs ( data) > 3 ) . any ( axis= 1 ) ]
0 1 2 3 16 -3.010992 -0.122886 1.194125 0.702766 111 -1.152743 4.108418 -2.097178 0.831827 219 -3.127882 1.781813 0.011281 0.587799 565 0.099141 -1.705600 3.597951 0.345174 596 3.162745 -1.597465 -0.552896 -2.756078 625 -0.042392 3.189888 0.723891 -0.670110 835 -1.125737 -0.699685 -1.730857 4.410464
data[ np. abs ( data) > 3 ] = np. sign( data) * 3
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean -0.001711 -0.037868 0.048582 0.009099 std 1.006490 0.966920 1.011305 0.977041 min -3.000000 -2.643439 -2.949846 -2.962251 25% -0.678444 -0.719395 -0.650478 -0.636513 50% 0.001463 -0.022189 0.061264 0.046104 75% 0.675910 0.629861 0.710325 0.642668 max 3.000000 3.000000 3.000000 3.000000
np. sign( data) . head( )
0 1 2 3 0 1.0 1.0 1.0 -1.0 1 -1.0 1.0 1.0 -1.0 2 1.0 1.0 1.0 -1.0 3 1.0 1.0 1.0 -1.0 4 -1.0 -1.0 -1.0 -1.0
排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排 列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation, 可产生一个表示新顺序的整数数组
df = pd. DataFrame( np. arange( 5 * 4 ) . reshape( ( 5 , 4 ) ) )
df
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
sampler = np. random. permutation( 5 )
sampler
array([4, 0, 3, 1, 2])
df. take( sampler)
0 1 2 3 4 16 17 18 19 0 0 1 2 3 3 12 13 14 15 1 4 5 6 7 2 8 9 10 11
df. sample( n= 3 )
choices= pd. Series( [ 5 , 7 , - 1 , 6 , 4 ] )
draws= choices. sample( n= 10 , replace= True )
draws
0 5
2 -1
0 5
3 6
0 5
3 6
1 7
1 7
2 -1
0 5
dtype: int64
哑变量处理
df = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ] ,
'data1' : range ( 6 ) } )
df
key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5
pd. get_dummies( df[ 'key' ] )
a b c 0 False True False 1 False True False 2 True False False 3 False False True 4 True False False 5 False True False
你可能想给指标DataFrame的列加上一个前缀,以便能够跟其他数据进行 合并。get_dummies的prefix参数可以实现该功能
dummies= pd. get_dummies( df[ 'key' ] , prefix= 'key' )
df_with_dummy= df[ [ 'data1' ] ] . join( dummies)
df_with_dummy
data1 key_a key_b key_c 0 0 False True False 1 1 False True False 2 2 True False False 3 3 False False True 4 4 True False False 5 5 False True False
如果DataFrame中的某行同属于多个分类,则事情就会有点复杂。看一下 MovieLens 1M数据集
mnames = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( 'F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat' , sep= '::' , header= None , names= mnames, encoding= 'ISO-8859-1' )
movies[ : 10 ]
C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy 5 6 Heat (1995) Action|Crime|Thriller 6 7 Sabrina (1995) Comedy|Romance 7 8 Tom and Huck (1995) Adventure|Children's 8 9 Sudden Death (1995) Action 9 10 GoldenEye (1995) Action|Adventure|Thriller
all_genres= [ ]
movies. genres. map ( lambda x: all_genres. extend( x. split( '|' ) ) )
all_genres
['Animation',
"Children's",
'Comedy',
'Adventure',
"Children's",
'Fantasy',
'Comedy',
'Romance',
'Comedy',
'Drama',
'Comedy',
'Action',
'Crime',
'Thriller',
'Comedy',
'Romance',
'Adventure',
"Children's",
'Action',
'Action',
'Adventure',
'Thriller',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Horror',
'Animation',
"Children's",
'Drama',
'Action',
'Adventure',
'Romance',
'Drama',
'Thriller',
'Drama',
'Romance',
'Thriller',
'Comedy',
'Action',
'Action',
'Comedy',
'Drama',
'Crime',
'Drama',
'Thriller',
'Thriller',
'Drama',
'Sci-Fi',
'Drama',
'Romance',
'Drama',
'Drama',
'Romance',
'Adventure',
'Sci-Fi',
'Drama',
'Drama',
'Drama',
'Sci-Fi',
'Adventure',
'Romance',
"Children's",
'Comedy',
'Drama',
'Drama',
'Romance',
'Drama',
'Documentary',
'Comedy',
'Comedy',
'Romance',
'Drama',
'Drama',
'War',
'Action',
'Crime',
'Drama',
'Drama',
'Action',
'Adventure',
'Comedy',
'Drama',
'Drama',
'Romance',
'Crime',
'Thriller',
'Animation',
"Children's",
'Musical',
'Romance',
'Drama',
'Romance',
'Crime',
'Thriller',
'Action',
'Drama',
'Thriller',
'Comedy',
'Drama',
"Children's",
'Comedy',
'Drama',
'Adventure',
"Children's",
'Fantasy',
'Drama',
'Drama',
'Romance',
'Drama',
'Mystery',
'Adventure',
"Children's",
'Fantasy',
'Drama',
'Thriller',
'Drama',
'Comedy',
'Comedy',
'Romance',
'Comedy',
'Sci-Fi',
'Thriller',
'Drama',
'Comedy',
'Romance',
'Comedy',
'Action',
'Comedy',
'Crime',
'Horror',
'Thriller',
'Action',
'Comedy',
'Drama',
'Drama',
'Musical',
'Drama',
'Romance',
'Comedy',
'Drama',
'Sci-Fi',
'Thriller',
'Documentary',
'Drama',
'Drama',
'Thriller',
'Drama',
'Crime',
'Drama',
'Romance',
'Drama',
'Drama',
'Comedy',
'Drama',
'Drama',
'Romance',
'Adventure',
'Drama',
"Children's",
'Comedy',
'Comedy',
'Action',
'Thriller',
'Drama',
'Drama',
'Thriller',
'Comedy',
'Romance',
'Drama',
'Action',
'Thriller',
'Comedy',
'Drama',
'Action',
'Thriller',
'Documentary',
'Drama',
'Thriller',
'Comedy',
'Comedy',
'Thriller',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Drama',
'Adventure',
"Children's",
'Comedy',
'Musical',
'Documentary',
'Comedy',
'Action',
'Drama',
'War',
'Drama',
'Thriller',
'Action',
'Adventure',
'Crime',
'Drama',
'Mystery',
'Drama',
'Comedy',
'Documentary',
'Crime',
'Comedy',
'Romance',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Romance',
'Drama',
'Mystery',
'Romance',
'Drama',
'Comedy',
'Adventure',
"Children's",
'Fantasy',
'Drama',
'Documentary',
'Comedy',
'Romance',
'Drama',
'Drama',
'Romance',
'Thriller',
'Comedy',
'Drama',
'Documentary',
'Comedy',
'Documentary',
'Documentary',
'Drama',
'Action',
'Drama',
'Drama',
'Romance',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Action',
'Adventure',
"Children's",
'Drama',
'Drama',
'Crime',
'Drama',
'Thriller',
'Drama',
'Drama',
'Romance',
'War',
'Horror',
'Action',
'Adventure',
'Comedy',
'Crime',
'Drama',
'Drama',
'War',
'Comedy',
'Comedy',
'War',
'Adventure',
"Children's",
'Drama',
'Action',
'Adventure',
'Mystery',
'Sci-Fi',
'Drama',
'Thriller',
'War',
'Documentary',
'Action',
'Romance',
'Thriller',
'Crime',
'Film-Noir',
'Mystery',
'Thriller',
'Action',
'Thriller',
'Comedy',
'Drama',
'Drama',
'Action',
'Adventure',
'Drama',
'Romance',
'Adventure',
"Children's",
'Drama',
'Action',
'Crime',
'Thriller',
'Comedy',
'Action',
'Sci-Fi',
'Thriller',
'Action',
'Adventure',
'Sci-Fi',
'Comedy',
'Drama',
'Comedy',
'Horror',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Action',
"Children's",
'Drama',
'Romance',
'Thriller',
'Drama',
'Sci-Fi',
'Thriller',
'Comedy',
'Comedy',
'Horror',
'Comedy',
'Thriller',
'Drama',
'Documentary',
'Drama',
'Drama',
'Comedy',
'Drama',
'Romance',
'Horror',
'Sci-Fi',
'Drama',
'Action',
'Crime',
'Sci-Fi',
'Drama',
'Musical',
'Thriller',
'Drama',
'Drama',
'Romance',
'Comedy',
'Action',
'Comedy',
'Drama',
'Documentary',
'Drama',
'Romance',
'Action',
'Adventure',
'Drama',
'Western',
'Drama',
'Comedy',
'Drama',
'Drama',
'Drama',
'Romance',
'Comedy',
'Drama',
'Thriller',
'Comedy',
'Drama',
'Drama',
'Horror',
'Drama',
'Romance',
'Comedy',
'Comedy',
'Drama',
'Romance',
'Drama',
'Thriller',
'Thriller',
'Action',
'Comedy',
'Drama',
'Thriller',
'Drama',
'Thriller',
'Comedy',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Comedy',
'Drama',
'Comedy',
'Romance',
'Comedy',
'Romance',
'Adventure',
"Children's",
'Animation',
"Children's",
'Comedy',
'Romance',
'Thriller',
"Children's",
'Drama',
'Drama',
'Musical',
'Comedy',
'Animation',
"Children's",
'Crime',
'Drama',
'Documentary',
'Drama',
'Fantasy',
'Romance',
'Thriller',
'Comedy',
'Drama',
'Romance',
"Children's",
'Comedy',
'Action',
'Comedy',
'Romance',
'Drama',
'Horror',
'Drama',
'Comedy',
'Comedy',
'Sci-Fi',
'Mystery',
'Thriller',
'Adventure',
"Children's",
'Comedy',
'Fantasy',
'Romance',
'Crime',
'Drama',
'Thriller',
'Action',
'Adventure',
'Fantasy',
'Sci-Fi',
'Drama',
"Children's",
'Drama',
'Drama',
'Drama',
'Drama',
'Romance',
'Drama',
'Romance',
'War',
'Western',
'Comedy',
'Drama',
'Drama',
'Drama',
'Romance',
'Drama',
'Drama',
'Drama',
'Horror',
'Comedy',
'Comedy',
'Comedy',
'Romance',
'Drama',
'Comedy',
'Drama',
'Drama',
'Thriller',
'Drama',
'Drama',
'Crime',
'Drama',
'Action',
'Crime',
'Drama',
'Horror',
'Action',
'Sci-Fi',
'Thriller',
'Comedy',
'Romance',
'Action',
'Thriller',
'Comedy',
'Romance',
'Crime',
'Drama',
'Thriller',
'Action',
'Drama',
'Thriller',
'Crime',
'Drama',
'Romance',
'Thriller',
'Comedy',
'Romance',
'Comedy',
'Romance',
'Crime',
'Drama',
'Drama',
'Comedy',
'Drama',
'Drama',
'Drama',
'Romance',
'Drama',
'Romance',
'Action',
'Adventure',
'Western',
'Comedy',
'Drama',
'Comedy',
'Drama',
'Drama',
'Drama',
'Drama',
'Comedy',
'Horror',
'Thriller',
'Comedy',
'Animation',
"Children's",
'Drama',
'Action',
'Action',
'Adventure',
'Sci-Fi',
"Children's",
'Comedy',
'Fantasy',
'Drama',
'Thriller',
'Film-Noir',
'Thriller',
'Drama',
'Comedy',
'Drama',
'Comedy',
'Comedy',
'Drama',
'Action',
'Comedy',
'Musical',
'Sci-Fi',
'Horror',
'Action',
'Adventure',
'Sci-Fi',
'Comedy',
'Horror',
'Drama',
'Horror',
'Sci-Fi',
'Comedy',
'Drama',
'Mystery',
'Thriller',
'Drama',
'War',
'Drama',
'Sci-Fi',
'Thriller',
'Comedy',
'Romance',
'Adventure',
'Drama',
'Drama',
'Comedy',
'Romance',
"Children's",
'Comedy',
'Comedy',
'Drama',
'Drama',
'Musical',
'Drama',
'Comedy',
'Action',
'Adventure',
'Thriller',
'Drama',
'Mystery',
'Thriller',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Action',
'Romance',
'Thriller',
'Drama',
"Children's",
'Comedy',
'Comedy',
'Romance',
'War',
'Comedy',
'Romance',
'Drama',
'Comedy',
'Drama',
'Romance',
'Action',
'Comedy',
'Drama',
'Romance',
'Adventure',
"Children's",
'Romance',
'Documentary',
'Animation',
"Children's",
'Musical',
'Drama',
'Horror',
'Comedy',
'Crime',
'Fantasy',
'Action',
'Comedy',
'Western',
'Drama',
'Comedy',
'Comedy',
'Drama',
'Comedy',
'Drama',
'Thriller',
"Children's",
'Comedy',
'Drama',
'Action',
'Thriller',
'Action',
'Romance',
'Thriller',
'Comedy',
'Romance',
'Action',
'Sci-Fi',
'Action',
'Adventure',
'Comedy',
'Romance',
'Drama',
'Drama',
'Horror',
'Western',
'Action',
'Drama',
'Drama',
'Action',
'Comedy',
'Drama',
'Drama',
'Romance',
'War',
'Action',
'Comedy',
'Drama',
'Crime',
'Drama',
'Adventure',
"Children's",
'Action',
'Action',
'Drama',
'Drama',
'Horror',
'Documentary',
'Drama',
'Drama',
'Action',
'Thriller',
'Comedy',
'Comedy',
'Crime',
'Drama',
'Documentary',
'Action',
'Sci-Fi',
'Drama',
'Horror',
'Thriller',
'Drama',
'Drama',
'Comedy',
'Comedy',
'Drama',
'Comedy',
'Comedy',
'Comedy',
'Thriller',
'Western',
'Comedy',
'Romance',
'Drama',
'Comedy',
'Action',
'Comedy',
'Adventure',
"Children's",
'Thriller',
'Action',
'Thriller',
'Drama',
'Drama',
'Romance',
'Horror',
'Sci-Fi',
'Thriller',
'Mystery',
'Romance',
'Thriller',
'Drama',
'Comedy',
'Drama',
'Crime',
'Drama',
'Comedy',
'Western',
'Comedy',
'Action',
'Adventure',
'Crime',
'Comedy',
'Sci-Fi',
'Drama',
'Thriller',
'Comedy',
'Action',
'Comedy',
'Drama',
'Comedy',
'Romance',
'Comedy',
'Action',
'Sci-Fi',
'Documentary',
'Comedy',
'Romance',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Romance',
'Drama',
'Comedy',
'Comedy',
'Drama',
'Drama',
'Mystery',
'Romance',
'Drama',
'Comedy',
'Drama',
'Thriller',
'Adventure',
"Children's",
'Drama',
'Drama',
'Action',
'Thriller',
'Drama',
'Western',
'Action',
'Comedy',
'Drama',
'Romance',
'Action',
'Adventure',
'Crime',
'Drama',
'Thriller',
'Action',
'Adventure',
'Crime',
'Thriller',
'Action',
'Drama',
'War',
'Action',
'Comedy',
'War',
'Comedy',
'Comedy',
'Romance',
'Drama',
'Romance',
'Comedy',
'Comedy',
'Romance',
'Comedy',
'Drama',
'Comedy',
'War',
'Action',
'Thriller',
'Drama',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Action',
'Action',
'Adventure',
'Sci-Fi',
'Drama',
'Thriller',
'Thriller',
'Drama',
'Adventure',
"Children's",
'Action',
'Comedy',
'Comedy',
'Comedy',
'Western',
'Drama',
'Comedy',
'Thriller',
'Drama',
'Comedy',
'Mystery',
'Action',
'Crime',
'Drama',
'Action',
'Thriller',
'Drama',
'Comedy',
'Drama',
'Romance',
'Comedy',
'Romance',
'Drama',
'Romance',
'Comedy',
'Romance',
'Comedy',
'Drama',
'Action',
"Children's",
'Drama',
'Action',
'Sci-Fi',
'Comedy',
'Drama',
'Action',
'Drama',
'Drama',
'Drama',
'Romance',
'Drama',
'Action',
'Drama',
'Horror',
'Sci-Fi',
'Comedy',
'Mystery',
'Romance',
'Comedy',
'Drama',
'Comedy',
'Drama',
'War',
'Action',
'Drama',
'Mystery',
'Comedy',
'Sci-Fi',
'Thriller',
'Comedy',
'Crime',
'Thriller',
'Action',
'Drama',
'Drama',
'Drama',
'Drama',
'Drama',
'Drama',
'War',
'Drama',
'Drama',
'Drama',
"Children's",
'Drama',
'Comedy',
'Crime',
'Horror',
'Action',
'Drama',
'Romance',
'Drama',
'Drama',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Romance',
'Thriller',
'Film-Noir',
'Sci-Fi',
'Comedy',
'Comedy',
'Romance',
'Thriller',
'Action',
'Drama',
'Action',
'Adventure',
"Children's",
'Sci-Fi',
'Action',
'Adventure',
'Thriller',
'Action',
'Documentary',
'Comedy',
'Romance',
"Children's",
'Comedy',
'Musical',
'Action',
'Adventure',
'Comedy',
'Western',
'Thriller',
'Action',
'Crime',
'Romance',
'Documentary',
'Drama',
'Action',
'Adventure',
'Animation',
"Children's",
'Fantasy',
'Comedy',
'Drama',
'Thriller',
'Comedy',
'Drama',
'Drama',
'Comedy',
'Horror',
'Comedy',
'Romance',
'Drama',
'Comedy',
'Drama',
"Children's",
'Comedy',
'Comedy',
'Drama',
'Drama',
'Drama',
'Drama',
'Comedy',
'Drama',
"Children's",
'Comedy',
'Comedy',
'Adventure',
"Children's",
'Drama',
'Mystery',
'Thriller',
'Drama',
'Documentary',
'Comedy',
'Comedy',
'Drama',
'Drama',
'Comedy',
"Children's",
'Comedy',
'Comedy',
'Romance',
'Thriller',
'Animation',
"Children's",
'Comedy',
'Musical',
'Action',
'Sci-Fi',
'Thriller',
'Adventure',
...]
genres= pd. unique( all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np. zeros( ( len ( movies) , len ( genres) ) )
dummies = pd. DataFrame( zero_matrix, columns= genres)
gen= movies. genres[ 0 ]
gen. split( "|" )
['Animation', "Children's", 'Comedy']
dummies. columns. get_indexer( gen. split( '|' ) )
array([0, 1, 2], dtype=int64)
for i, gen in enumerate ( movies. genres) :
indices = dummies. columns. get_indexer( gen. split( '|' ) )
dummies. iloc[ i, indices] = 1
movies_windic = movies. join( dummies. add_prefix( 'Genre_' ) )
movies_windic. iloc[ 0 ]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1.0
Genre_Children's 1.0
Genre_Comedy 1.0
Genre_Adventure 0.0
Genre_Fantasy 0.0
Genre_Romance 0.0
Genre_Drama 0.0
Genre_Action 0.0
Genre_Crime 0.0
Genre_Thriller 0.0
Genre_Horror 0.0
Genre_Sci-Fi 0.0
Genre_Documentary 0.0
Genre_War 0.0
Genre_Musical 0.0
Genre_Mystery 0.0
Genre_Film-Noir 0.0
Genre_Western 0.0
Name: 0, dtype: object
mnames = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( 'F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat' , sep= '::' , header= None , names= mnames, encoding= 'ISO-8859-1' )
movies[ : 10 ]
C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy 5 6 Heat (1995) Action|Crime|Thriller 6 7 Sabrina (1995) Comedy|Romance 7 8 Tom and Huck (1995) Adventure|Children's 8 9 Sudden Death (1995) Action 9 10 GoldenEye (1995) Action|Adventure|Thriller
dummies_demo = movies[ 'genres' ] . str . get_dummies( '|' )
prefix = 'genre_'
dummies_demo = dummies_demo. add_prefix( prefix)
merged_df = pd. concat( [ movies, dummies_demo] , axis= 1 )
merged_df
movie_id title genres genre_Action genre_Adventure genre_Animation genre_Children's genre_Comedy genre_Crime genre_Documentary ... genre_Fantasy genre_Film-Noir genre_Horror genre_Musical genre_Mystery genre_Romance genre_Sci-Fi genre_Thriller genre_War genre_Western 0 1 Toy Story (1995) Animation|Children's|Comedy 0 0 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 2 Jumanji (1995) Adventure|Children's|Fantasy 0 1 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 0 0 2 3 Grumpier Old Men (1995) Comedy|Romance 0 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 0 0 0 3 4 Waiting to Exhale (1995) Comedy|Drama 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 4 5 Father of the Bride Part II (1995) Comedy 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 3878 3948 Meet the Parents (2000) Comedy 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 3879 3949 Requiem for a Dream (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 3880 3950 Tigerland (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 3881 3951 Two Family House (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 3882 3952 Contender, The (2000) Drama|Thriller 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3883 rows × 21 columns
一个对统计应用有用的秘诀是:结合get_dummies和诸如cut之类的离散化函数
np. random. seed( 12345 )
values= np. random. rand( 10 )
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins= [ 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ]
pd. get_dummies( pd. cut( values, bins) )
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] 0 False False False False True 1 False True False False False 2 True False False False False 3 False True False False False 4 False False True False False 5 False False True False False 6 False False False False True 7 False False False True False 8 False False False True False 9 False False False True False
字符串操作
Python能够成为流行的数据处理语言,部分原因是其简单易用的字符串和文本处理 功能。
字符串对象方法
val = 'a,b, guido'
val. split( ',' )
['a', 'b', ' guido']
split常常与strip一起使用,以去除空白符(包括换行符):
pieces= [ x. strip( ) for x in val. split( ',' ) ]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::' . join( pieces)
'a::b::guido'
检测子串的最佳方式是利用Python的in关键字,还可 以使用index和find:
'guido' in val
True
val. index( ',' )
1
val. find( ':' )
-1
注意find和index的区别:如果找不到字符串,index将会引发一个异常(而不是返回 -1)
val. index( ':' )
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[88], line 1
----> 1 val.index(':')
ValueError: substring not found
count可以返回指定子串的出现次数
val. count( ',' )
2
replace用于将指定模式替换为另一个模式。通过传入空字符串,它也常常用于删除
val. replace( ',' , '::' )
'a::b:: guido'
val. replace( ',' , '' )
'ab guido'
Python内置的字符串方法
count:返回在字符串中的出现次数(非重叠) endswith、startswith:返回字符串以某个后缀结尾(以某个前缀结尾),则返回True join:将字符串用作连接其他字符串序列的分隔符 index:如果在字符串中找到子串,则返回子串第一个字符所在的位置,如果没有找到,则引发ValueError find: 如果在字符串中找到子串,则返回第一个发现子串第一个字符所在的位置,如果没有找到,返回-1 rfind:如果在字符串中找到子串,则返回最后一个发现子串第一个字符所在的位置,如果没有找到,返回-1 replace:用另一个字符替换指定字符 strip.rstrip.lstrip:去除空白符(包括换行符) split:通过指定的分隔符将字符串拆分成一组子串 lower、upper:大小写 ljust、rjust:用空格(或其他字符)填充字符串的空白侧以返回符合最低宽度的字符串
正则表达式
re模块的函数可以分为三个大类:模式匹配、替换以及拆分
import re
text = "foo bar\t baz \tqux"
re. split( '\s+' , text)
['foo', 'bar', 'baz', 'qux']
可以用re.compile自己编译regex以得到一个可重用的regex对象
regex= re. compile ( '\s+' )
regex. split( text)
['foo', 'bar', 'baz', 'qux']
regex. findall( text)
[' ', '\t ', ' \t']
match和search跟findall功能类似
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re. compile ( pattern, flags= re. IGNORECASE)
regex. findall( text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
m = regex. search( text)
m
<re.Match object; span=(5, 20), match='dave@google.com'>
text[ m. start( ) : m. end( ) ]
'dave@google.com'
print ( regex. match ( text) )
None
sub方法可以将匹配到的模式替换为指定字符串
print ( regex. sub( 'REDACTED' , text) )
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re. compile ( pattern, flags= re. IGNORECASE)
m = regex. match ( 'wesm@bright.net' )
m. groups( )
('wesm', 'bright', 'net')
regex. findall( text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print ( regex. sub( r'Username: \1, Domain: \2, Suffix: \3' , text) )
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
正则表达式方法
findall、finditer:返回字符串中所有的非重叠匹配模式。findall返回的是有所有模式组成的列表,而finditer则通过一个迭代器逐个返回 match:从字符串起始位置匹配模式,还可以对模式各个部分进行分组。如果匹配到模式,则返回一个匹配项对象,否则返回None search:扫描整个字符串以匹配模式,如果找到则返回一个匹配项对象。跟match不同,其匹配项可以位于字符串的任意位置,而不仅仅是起始处 split:根据找到的模式将字符串拆分成数段 sub、subn:将字符串中所有的(sub)或前n个(subn)模式替换成指定表达式。在替换字符串中可以通过\1、\2等符号表示各分项组
pandas的矢量化字符串函数
清理待分析的散乱数据时,常常需要做一些字符串规整化工作。更为复杂的情况 是,含有字符串的列有时还含有缺失数据:
data = { 'Dave' : 'dave@google.com' , 'Steve' : 'steve@gmail.com' , 'Rob' : 'rob@gmail.com' , 'Wes' : np. nan}
data= pd. Series( data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data. isnull( )
Dave False
Steve False
Rob False
Wes True
dtype: bool
data. str . contains( 'gmail' )
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data. str . findall( pattern, flags= re. IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
有两个办法可以实现矢量化的元素获取操作:要么使用str.get,要么在str属性上使 用索引:
matches = data. str . match ( pattern, flags= re. IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jxFV7MY2-1692080773017)(image.png)]