啃书 《利用python进行数据分析》第七章数据清洗与准备
7.1处理缺失值
NA值处理方法
函数名 | 描述 |
---|---|
dropna | 根据值是否确实来筛选轴标签 |
fillna | 用某些值填充缺失的数据或使用插值法:‘ffill’或’bfill’ |
isnull | 返回表明哪些值是缺失值的布尔值 |
notnull | 返回表明哪些值不是缺失值的布尔值 |
string_data=pd.Series(['aardvs','fdssd',np.nan,'sdfsdvx'])
string_data
Out[4]:
0 aardvs
1 fdssd
2 NaN
3 sdfsdvx
dtype: object
string_data.isnull()
Out[5]:
0 False
1 False
2 True
3 False
dtype: bool
string_data[0]=None
string_data.notnull()
Out[7]:
0 False
1 True
2 False
3 True
dtype: bool
7.1.1过滤缺失值 dropna()
1.对于Series:直接删除该值以及对应索引
2.对于DataFrame:
全部参数默认:删除所有有NA值的行
axis: =1就是对列操作,默认等于0,默认对行操作
how:=’all’ 只有全部为NA才会过滤
thresh:相当于阈值,只有当每行(或每列)的缺失值个数达到这个值时,才会过滤掉
from numpy import nan as NA
#对于Series:
data=pd.Series([1,NA,3.5,NA,7])
data.dropna()
Out[10]:
0 1.0
2 3.5
4 7.0
dtype: float64
data[data.notnull()]
Out[11]:
0 1.0
2 3.5
4 7.0
dtype: float64
#对于DataFrame
data=pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
cleaned=data.dropna()
data
Out[14]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned
Out[16]:
0 1 2
0 1.0 6.5 3.0
data.dropna(how='all')
Out[17]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4]=NA
data
Out[19]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1,how='all')
Out[20]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df=pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1]=NA
df.iloc[:2,2]=NA
df
Out[25]:
0 1 2
0 -0.858483 NaN NaN
1 0.638135 NaN NaN
2 -1.000207 NaN 0.560223
3 0.563660 NaN -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
df.dropna(thresh=2)
Out[26]:
0 1 2
2 -1.000207 NaN 0.560223
3 0.563660 NaN -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
7.1.2补全缺失值 fillna()
fillna函数参数
参数 | 描述 |
---|---|
value | 标量值或字典型对象用于填充缺失值 |
method | 插值方法’fill’,‘bfill’;如果没有其他参数默认是’ffill’ |
axis | 需要填充的轴,默认axis=0 |
inplace | 是否在原对象上修改 |
limit | 用于前向或后向填充时最大的填充范围 |
df
Out[28]:
0 1 2
0 -0.858483 NaN NaN
1 0.638135 NaN NaN
2 -1.000207 NaN 0.560223
3 0.563660 NaN -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
#标量填充
df.fillna(0)
Out[29]:
0 1 2
0 -0.858483 0.000000 0.000000
1 0.638135 0.000000 0.000000
2 -1.000207 0.000000 0.560223
3 0.563660 0.000000 -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
#字典型填充缺失值
df.fillna({1:0.5,2:0})
Out[30]:
0 1 2
0 -0.858483 0.500000 0.000000
1 0.638135 0.500000 0.000000
2 -1.000207 0.500000 0.560223
3 0.563660 0.500000 -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
#直接在原对象上修改
_=df.fillna(0,inplace=True)
df
Out[32]:
0 1 2
0 -0.858483 0.000000 0.000000
1 0.638135 0.000000 0.000000
2 -1.000207 0.000000 0.560223
3 0.563660 0.000000 -1.117940
4 1.079167 0.634505 0.102725
5 0.852507 -0.023895 -0.689219
6 -1.141851 0.833480 0.092697
df=pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1]=NA
df.iloc[4:,2]=NA
df
Out[36]:
0 1 2
0 0.366302 -1.029316 1.463896
1 1.316928 0.133006 -1.839687
2 -0.718139 NaN 0.295226
3 0.278662 NaN -0.434239
4 -1.191636 NaN NaN
5 0.359061 NaN NaN
#前向填充
df.fillna(method='ffill')
Out[37]:
0 1 2
0 0.366302 -1.029316 1.463896
1 1.316928 0.133006 -1.839687
2 -0.718139 0.133006 0.295226
3 0.278662 0.133006 -0.434239
4 -1.191636 0.133006 -0.434239
5 0.359061 0.133006 -0.434239
df.fillna(method='ffill',limit=2)
Out[38]:
0 1 2
0 0.366302 -1.029316 1.463896
1 1.316928 0.133006 -1.839687
2 -0.718139 0.133006 0.295226
3 0.278662 0.133006 -0.434239
4 -1.191636 NaN -0.434239
5 0.359061 NaN -0.434239
data=pd.Series([1.,NA,3.5,NA,7])
data.fillna(data.mean())
Out[40]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
7.2数据转换
7.2.1删除重复值 duplicated()/drop_duplicates()
df.duplicated()返回一个布尔值Series,反应每一列是否存在重复(与之前出现过的行相同,第一次出现的是False)
df.drop_duplicates()返回duplicated中是False的那部分内容
可以基于子列进行去重操作,也可以用keep参数=‘last’选择保留最后一次重复的值
data=pd.DataFrame({'k1':['one','two']*3+['two'],'k2':[1,1,2,3,3,4,4]})
data.duplicated()
Out[43]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
Out[45]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
data=pd.DataFrame({'k1':['one','two']*3+['two'],'k2':[1,1,2,3,3,4,4]})
data.duplicated()
Out[43]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates
Out[44]:
<bound method DataFrame.drop_duplicates of k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4>
data.drop_duplicates()
Out[45]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
data['v1']=range(7)
data.drop_duplicates(['k1'])
Out[47]:
k1 k2 v1
0 one 1 0
1 two 1 1
data.drop_duplicates(['k1','k2'],keep='last')
Out[48]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
7.2.2使用函数或映射进行数据转换 map()
series.map(字典)
返回一个新的Series:利用series的值根据字典中的键,找到字典中对应的值。
(打个比方,Serise是一组英语单词,字典就是英汉字典,最后返回的series就是对应的汉语)
data=pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],'ounces':[4,3,12,6,7.5,8,3,5,6]})
data
Out[59]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal={'bacon':'pig','pulled pork':'pig','pastrami':'cow','corned beef':'cow','honey ham':'pig','nova lox':'salmon'}
lowercased=data['food'].str.lower()
data['animal']=lowercased.map(meat_to_animal)
data
Out[63]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
#也可以传入一个能完成所有工作的函数
data['food'].map(lambda x:meat_to_animal[x.lower()])
Out[64]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
代码中涉及的函数:
Series的str.lower方法将每个值都转为小写
7.2.3代替值 replace()
1.replace(旧值,新值)
2.旧值可以是某个具体值,也可以是被替换掉列表。如,data.replace([-999,-1000],np.nan)
3.可以利用列表或字典实现对应替代。如data.replace([-999,-1000],[np.nan,0]);data.replace({-999:np.nan,-1000:0})
都是将-999替换成np.nan,将-1000替换成0
data=pd.Series([1.,-999,2.,-999.,-1000.,3.])
data
Out[66]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data.replace(-999,np.nan)
Out[67]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999,-1000],np.nan)
Out[68]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
#对应替换
data.replace([-999,-1000],[np.nan,0])
Out[69]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace({-999:np.nan,-1000:0})
Out[70]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
7.2.4重命名轴索引 rename()
1.轴索引当然也可以利用map函数实现对应的修改
data=pd.DataFrame(np.arange(12).reshape(3,4),index=['Ohio','Colorado','New York'],columns=['one','two','three','four'])
transform=lambda x:x[:4].upper()
data.index.map(transform)
Out[73]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index=data.index.map(transform)
data
Out[75]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
这种适合具有普遍性规律的。
2.rename(),能够创建数据集转换后的版本,并且不修改原有数据集
可以实现用函数进行集体修改
可以只改某个指定值
inplace参数决定是否修改原数据
data.rename(index=str.title,columns=str.upper)
Out[76]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
data.rename(index={'OHIO':'INDIANA'},columns={'three':'peekaboo'})
Out[77]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
data.rename(index={'OHIO':'INDIANA'},inplace=True)
data
Out[79]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
7.2.5离散化和分箱cut()\qcut()
就是将连续值离散化,或者分离成‘箱子’进行分析。好比直方图中对数据分组
cut()函数考虑值的大小
qcut()函数考虑的是数据的分布
方法一cut():给定分箱的各个临界点
1.传入的参数cut(待分箱对象,分箱节点)
2.分箱出的结果:返回一个特殊的Categorical对象。计算出原对象各个元素所属的箱
3.分箱结果对象的codes属性:将各个分箱结果用数字打标签
分箱结果对象的categories属性:分箱种类
4.labels参数:自定义箱名
ages=[20,22,25,27,21,23,37,31,61,45,41,32]
bins=[18,25,35,60,100]
cats=pd.cut(ages,bins)
cats
Out[83]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
Out[84]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
Out[85]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd.value_counts(cats)
Out[86]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
pd.cut(ages,[18,26,36,61,100],right=False)
Out[87]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names=['Youth','YoungAdult','MiddleAged','Senior']
pd.cut(ages,bins,labels=group_names)
Out[89]:
['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']
方法二cut():给定箱子个数,根据最大值最小值计算等长的箱,适合均匀分布的数据
data=np.random.rand(20)
data
Out[92]:
array([0.0300036 , 0.02586357, 0.86888746, 0.22032555, 0.66613459,
0.17344557, 0.25763526, 0.84981327, 0.45797582, 0.50000199,
0.28144401, 0.53081092, 0.68836217, 0.82392281, 0.21856934,
0.01294174, 0.43667986, 0.18565561, 0.53055634, 0.47388021])
pd.cut(data,4,precision=2)
Out[93]:
[(0.012, 0.23], (0.012, 0.23], (0.65, 0.87], (0.012, 0.23], (0.65, 0.87], ..., (0.012, 0.23], (0.23, 0.44], (0.012, 0.23], (0.44, 0.65], (0.44, 0.65]]
Length: 20
Categories (4, interval[float64]): [(0.012, 0.23] < (0.23, 0.44] < (0.44, 0.65] < (0.65, 0.87]]
方法三qcut():给定分箱个数,根据分位数分箱,考虑样本的分布,各个箱的样本数量相同
data=np.random.randn(1000)
cats=pd.qcut(data,4)
cats
Out[96]:
[(0.621, 3.873], (0.621, 3.873], (-0.0264, 0.621], (-3.4899999999999998, -0.674], (-0.674, -0.0264], ..., (-0.674, -0.0264], (0.621, 3.873], (-0.0264, 0.621], (0.621, 3.873], (0.621, 3.873]]
Length: 1000
Categories (4, interval[float64]): [(-3.4899999999999998, -0.674] < (-0.674, -0.0264] < (-0.0264, 0.621] < (0.621, 3.873]]
pd.value_counts(cats)
Out[97]:
(0.621, 3.873] 250
(-0.0264, 0.621] 250
(-0.674, -0.0264] 250
(-3.4899999999999998, -0.674] 250
dtype: int64
方法四qcut():给定分箱分位数,根据分位数分箱,考虑样本的分布,各个箱的样本数量相同
pd.qcut(data,[0,0.1,0.5,0.9,1.])
Out[98]:
[(-0.0264, 1.357], (-0.0264, 1.357], (-0.0264, 1.357], (-1.232, -0.0264], (-1.232, -0.0264], ..., (-1.232, -0.0264], (1.357, 3.873], (-0.0264, 1.357], (-0.0264, 1.357], (1.357, 3.873]]
Length: 1000
Categories (4, interval[float64]): [(-3.4899999999999998, -1.232] < (-1.232, -0.0264] < (-0.0264, 1.357] < (1.357, 3.873]]
7.2.6检测和过滤异常值
这里讲到的比较基础,思路比较简单……进本上就是,自己找异常值,然后自己修改。
知识点就是.any().all()方法
df.any(),针对列,只要有一个True,就都是True
df.any(1),针对每行,只要有一个True,就都是True
同理.all()就是需要全为True是才会返回True
这类函数用于Series,返回一个布尔值
用于dataframe时,对每一列进行运算,返回一个布尔值Series,个数与dataframe列数相同。
参数axis=1时(df.any(1)就是这种情况),对每一行进行运算,返回一个布尔值Series,个数与dataframe行数相同。
data=pd.DataFrame(np.random.randn(1000,4))
data.describe()
Out[102]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.027491 0.021373 -0.021642 -0.032679
std 0.990526 0.977964 0.995485 1.009552
min -3.513944 -2.782106 -2.914202 -3.147163
25% -0.721894 -0.615292 -0.712474 -0.726641
50% -0.014292 0.026422 -0.001167 0.023098
75% 0.652872 0.652959 0.651896 0.650330
max 2.949306 3.172291 3.358915 2.938032
col=data[2]
col[np.abs(col)>3]
Out[104]:
213 3.358915
Name: 2, dtype: float64
data[(np.abs(data)>3).any(1)]
Out[105]:
0 1 2 3
130 -3.513944 0.216267 -0.152251 1.394209
213 0.272360 0.428468 3.358915 0.847807
248 -0.749901 3.172291 -0.156018 -0.390211
867 -1.266390 -0.651224 -0.760254 -3.147163
#将大于3的值修改为3,将小于-3的值修改为-3
data[np.abs(data)>3]=np.sign(data*3)
data.describe()
Out[112]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.024978 0.019201 -0.024001 -0.030532
std 0.984842 0.973360 0.990247 1.005195
min -2.923770 -2.782106 -2.914202 -2.927380
25% -0.721894 -0.615292 -0.712474 -0.726641
50% -0.014292 0.026422 -0.001167 0.023098
75% 0.652872 0.652959 0.651896 0.650330
max 2.949306 2.764249 2.781202 2.938032
代码中的np.sign函数是符号函数,在第四章的numpy中通用函数部分提到过。正数返回1,负数返回-1,0返回0
7.2.7置换和随机抽样
numpy.random.permutation:根据轴长度产生一个表示新顺序的整数数组(就是随机排序)
df.take()对数据框的行进行重新排序
df.sample(n=),对数据框随机取n行,并打乱顺序。replace=True,能够生成允许重复的样本
df=pd.DataFrame(np.arange(5*4).reshape(5,4))
sampler=np.random.permutation(5)
sampler
Out[118]: array([1, 2, 0, 3, 4])
df
Out[119]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
df.take(sampler)
Out[120]:
0 1 2 3
1 4 5 6 7
2 8 9 10 11
0 0 1 2 3
3 12 13 14 15
4 16 17 18 19
df.sample(n=3)
Out[121]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
choices=pd.Series([5,7,-1,6,4])
draws=choices.sample(n=10,replace=True)
draws
Out[124]:
3 6
0 5
2 -1
3 6
2 -1
2 -1
0 5
1 7
4 4
0 5
dtype: int64
choices.sample(n=4,replace=True)
Out[125]:
3 6
3 6
0 5
4 4
dtype: int64
7.2.8计算指标/虚拟变量
将分类变量转变成一种01矩阵,在统计建模和机器学习中很常见。
如Dataframe中的一列有k个不同的值,则可以衍生一个k列的值为0和1的矩阵。
可以用**pd.get_dummies()**函数实现该功能。
df=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
pd.get_dummies(df['key'])
Out[6]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
dummies=pd.get_dummies(df['key'],prefix='key')
df_with_dummy=df[['data1']].join(dummies)
df_with_dummy
Out[9]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
参数**prefix=**可以给生成的列的列名加上一个前缀
**join()**方法能够将两个dataframe连起来。(但是列名不能有重复)
【书中这个地方还举了一个例子,由于没有数据就不做了】
get_dummies和cut函数联合使用:
np.random.seed(12345)
values=np.random.rand(10)
values
Out[29]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins=[0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
Out[31]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
7.3字符串操作
7.3.1字符串对象方法
python内建的字符串方法:
方法 | 描述 |
---|---|
count | 返回子字符串在字符串中的出现次数 |
endswith | 是否字符串以后缀结尾,返回True或False |
startswith | 是否字符串以前缀开始,返回True或False |
join | 使用字符串作为间隔符,粘合其他字符串序列 |
index | 如果在字符串中找到,则返回子字符串中的第一个字符的位置,找不到则引发ValueError |
find | 返回字符串中第一个出现子字符串的第一个字符的位置,没找到返回-1 |
rfind | 返回字符串中最后一次出现子字符串时第一个字符的位置,没找到返回-1 |
replace | 使用一个字符串代替另一个字符串 |
strip rstrip lstrip | 删除空格,包括换行符 |
split | 使用分隔符将字符串拆分为子字符串列表 |
lower | 将大写字母转换为小写字母 |
upper | 将小写字母转换为大写字母 |
casefold | 将字符串转为小写,并将任何特定与区域的变量字符组合转换为常见的可比较形式 |
ljust,rjust | 左对齐或右对齐;用空格(或其他字符)填充字符串的相反侧以返回具有最小宽度的字符串 |
val='a,b, guido'
val.split(',')
Out[33]: ['a', 'b', ' guido']
pieces=[x.strip() for x in val.split(',')]
pieces
Out[37]: ['a', 'b', 'guido']
'::'.join(pieces)
Out[38]: 'a::b::guido'
'guido' in val
Out[39]: True
val.index(',')
Out[41]: 1
val.find(':')
Out[42]: -1
val.count(',')
Out[43]: 2
val.replace(',',':')
Out[44]: 'a:b: guido'
val.replace(',','')
Out[45]: 'ab guido'
7.3.2正则表达式
re模块主要有三个主题:模式匹配、替代、拆分
1.正则表达式对象的获取re.compile()
如re.compile(‘\s+’):一个或多个空白字符的正则表达式对象。括号里的是一个正则表达式
2.正则表达式的各种匹配
3.正则表达式对象的方法
一个简单的实例
import re
text='foo bar\t baz \tqux'
re.split('\s+',text)#注意是re模块的方法
Out[48]: ['foo', 'bar', 'baz', 'qux']
regex=re.compile('\s+')
regex.split(text)
Out[50]: ['foo', 'bar', 'baz', 'qux']
regex.findall(text)
Out[51]: [' ', '\t ', ' \t']
text="""Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
#re.IGNORECASE正则表达式不区分大小写
regex=re.compile(pattern,flags=re.IGNORECASE)
regex.findall(text)
Out[57]: ['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
regex.search(text)
Out[58]: <re.Match object; span=(5, 20), match='dave@google.com'>
#search返回第一个匹配项
m=regex.search(text)
#m.start()是匹配项的第一个字符在text中的位置
#m.end()是20,对应'\n'
text[m.start():m.end()]
Out[60]: 'dave@google.com'
#match只在字符串的起始位置进行匹配
print(regex.match(text))
None
print(regex.sub('REDACTED',text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
#用括号将模式包起来,可以进行分组
pattern=r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex=re.compile(pattern,flags=re.IGNORECASE)
m=regex.match('wesm@bright.net')
m.groups()
Out[70]: ('wesm', 'bright', 'net')
regex.findall(text)
Out[71]:
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
7.3.3pandas中的向量化字符串函数
向量化字符串的理解:
向量化字符串就是元素类型是字符串的向量。(也比如一列全是字符串元素的列表啦)
numpy和pandas很好的解决了对一行或一列数值数据,只用一个函数来进行处理。如果是列字符串该怎么做呢?难道要for循环再使用字符串的函数?有没有什么办法简化?
简化方法就是:pd.Series的str属性
这样我们对于前面讲的很多字符串操作,向量化字符串也有相应的解决办法啦
data=pd.Series({'Dave':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Wes':np.nan})
data
Out[74]:
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data.isnull()
Out[75]:
Dave False
Steve False
Rob False
Wes True
dtype: bool
data.str.contains('gmail')
Out[76]:
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern
Out[77]: '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern,flags=re.IGNORECASE)
Out[78]:
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
matches=data.str.match(pattern,flags=re.IGNORECASE)
matches
Out[80]:
Dave True
Steve True
Rob True
Wes NaN
dtype: object
x=data.str.findall(pattern,flags=re.IGNORECASE)
x
Out[91]:
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
x.str.get(1)
Out[92]:
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
x.str.get(0)
Out[93]:
Dave (dave, google, com)
Steve (steve, gmail, com)
Rob (rob, gmail, com)
Wes NaN
dtype: object
x.str[0]
Out[94]:
Dave (dave, google, com)
Steve (steve, gmail, com)
Rob (rob, gmail, com)
Wes NaN
dtype: object
x.str[0,1]
Out[95]:
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
x.str[0][1]
Out[96]: ('steve', 'gmail', 'com')
data
Out[97]:
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN![image-20220501165036295](C:\Users\章ky\AppData\Roaming\Typora\typora-user-images\image-20220501165036295.png)![image-20220501165040324](C:\Users\章ky\AppData\Roaming\Typora\typora-user-images\image-20220501165040324.png)
dtype: object
data.str[:5]
Out[98]:
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object
向量化字符串的操作