⽂本数据

最新推荐文章于 2024-10-17 08:00:29 发布

spring小郭

最新推荐文章于 2024-10-17 08:00:29 发布

阅读量304

点赞数

分类专栏： pandas 文章标签： python 大数据

本文链接：https://blog.csdn.net/Ilovechase/article/details/106973995

版权

pandas 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Task02：文本数据

文本数据

Task02：文本数据

⼀、string类型的性质

1. string与object的区别

string类型和object不同之处：
① 字存⽅法（string accessor methodsstr.count）会回相应数据的Nullable类型而object会随失值的存在而改变回类型
② Series⽅法不在string上使⽤例 Series.str.decode()因为存储的是字串而不是字节
③ string类型在失值存储或运时类型会⼴为pd.NA而不是浮点型np.nan
其余部内容在当本下完⼀但合Pandas的发展式我们仍部⽤string来作字串

2. string类型的转换

In[1] import pandas as pd 
      import numpy as np

先转为str型object在转为string类型

In [3]: pd.Series([4,'5.']).astype('str').astype('string') 
Out[3]: 0     4 
        1     5.
         dtype: string
 In [4]: pd.Series([5,6]).astype('str').astype('string') 
 Out[4]: 0    5
         1    6 
         dtype: string
In [5]: pd.Series([True,False]).astype('str').astype('string') 
Out[5]: 0     True 
        1    False 
        dtype: string

⼆、拆分与拼接

1. str.split⽅法

2.1.1分割符与str的位置元素选

In [6]: s = pd.Series(['acd', 'fdx', np.nan, 'qwe'], dtype="string") s 
Out[6]: 0    acd 
        1    fdx 
        2     <NA> 
        3    qwe 
        dtype: string

注意:split后的类型是object因为现在Series中的元素已经不是string而包了list且string类型只字串

三、替换

⼴义上的替换，就是str.replace函数的应⽤，fillna是针对缺失值的换。

1. str.replace的常见方法

In [24]: t = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'],dtype="strin g")
        t

 Out[24]: 0       A 
          1       B
          2       C
          3    Aaba
          4    Baca 
          5         
          6    <NA>
          7    CABA
          8    dog
          9     cat
          dtype: string

⼀个值写r开头的正则表达式后⼀个写换的字符串

In [25]: s.str.replace(r'^[AB]','***') 
Out[25]: 0       ***
         1       *** 
         2         C 
         3    ***aba 
         4    ***aca
         5           
         6      <NA> 
         7      CABA 
         8       dog 
         9       cat 
         dtype: string

2. 关于str.replace的注意事项

⾸先str.replace和replace并不是⼀个东西
str.replace针对的是object类型或string类型认是正则表达式为作⽬暂时不⽀DataFrame上使⽤
replace针对的是意类型的序列或数据框果正则表达式换设置regex=True该⽅法过字典可⽀列换
但现在由于string类型的初步引⽤法上出现了⼀问题这issue在后的本中

四、子串匹配与提取

1.expand参数（认为True）
对于⼀个⼦组的Series果expand设置为False则回Series若⼤于⼀个⼦组则expand参数⽆效部回DataFrame
对于⼀个⼦组的Index果expand设置为False则回提后的Index若⼤于⼀个⼦组且expand为False报错
2. str.extractall⽅法
与extract只⼀个合件的表达式不同extractall会找出所合件的字串并建⽴级索引（即使只找到⼀个）
3. str.contains和str.match
前者的作⽤为检测是否包种正则式，str.match与其区别在于match依赖于python的re.match检测内容为是否从头开包该正则