4.3 字符串的操作

最新推荐文章于 2021-08-05 14:47:12 发布

skywuuuu

最新推荐文章于 2021-08-05 14:47:12 发布

阅读量649

点赞数 1

分类专栏：《利用Python进行数据分析》笔记+整理+案例文章标签： python 正则表达式字符串

本文链接：https://blog.csdn.net/skywuuu/article/details/109600726

版权

《利用Python进行数据分析》笔记+整理+案例专栏收录该内容

14 篇文章 4 订阅

订阅专栏

4.3 字符串的操作

字符串对象方法

一些内置方法

split

Python split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串，默认值为-1，代表切割全部。

val = 'a,b,  guido'

val.split(',')

['a', 'b', '  guido']

strip

Python strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列。注意：该方法只能删除开头或是结尾的字符,不能删除中间部分的字符。

pieces = [x.strip() for x in val.split(',')]

pieces

['a', 'b', 'guido']

加号

加号在字符串中代表concatenate（连接）

a, b, c = pieces

a + '::' + b + '::' + c

'a::b::guido'

join方法

说明：Python join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。
用法：str.join(sequence)（str是分隔符，sequence是要分割的列表）

'::'.join(pieces)

'a::b::guido'

in关键字

'guido' in pieces

True

index

找不到对应substring会报错

val.index('b')

val.index('::')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-11-941a327f542d> in <module>
----> 1 val.index('::')


ValueError: substring not found

find

找不到对应substring会返回-1.不会报错

val.find('b')

val.find('b:')

-1

count

计算substring出现次数

val.count(',')

replace

替换字符
如果替换的是空字符串，那么就是删除原本的字符

val.replace(',',':')

'a:b:  guido'

Python内置的字符串方法

方法	说明
count	返回子串在字符串中的出现次数(非重叠)
endswith, startswith	如果字符串以某个后缀结尾(以某个前缀开头)，则返回True
join	将字符串用作连接其他字符串序列的分隔符
index	如果在字符串中找到子串，则返回子串第一个字符所在的位置。如果没有找到，则引发ValueError.
find	如果在字符串中找到子串，则返回第一个发现的子串的第一个字符所在的位置。如果没有找到，则返回-1
rfind	如果在字符串中找到子串，则返回最后一个发现的子串的第一个字符所在的位置。如果没有找到，则返回-1
replace	用另一个字符串替换指定子串
strip, rstrip, lstrip	去除空白符 (包括换行符)。相当于对各个元素执行x.strip()(以及rstrip. Istrip)
split	通过指定的分隔符将字符串拆分为一组子串
lower, upper	分别将字母字符转换为小写或大写
ljust, rjust	用空格(或其他字符)填充字符串的空白侧以返回符合最低宽度的字符串

正则表达式

模式匹配
替换
拆分

import re #引入正则表达式的包

text = "foo   bar\t baz    \tqux"

re.split（根据分隔符拆分）

re.split('\s+',text) #\s+代表一个或多个空白符

['foo', 'bar', 'baz', 'qux']

re.compile

如果许多字符串应用同一条正则表达式，提前用re.compile编译一次可以省下很多时间

erase_space = re.compile('\s+')

re.split(erase_space,text)

['foo', 'bar', 'baz', 'qux']

re.findall（根据regular expression匹配）

匹配regex（正则表达式）的所有模式

erase_space.findall(text)

['   ', '\t ', '    \t']

例子：

text="""Dave dave@google.com 
Steve steve@gmail.com 
Rob rob@gmail.com 
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

**r’'的理解：**Python中字符串前加上r，用来表示原生字符串（rawstring）也就是将反斜杠等等符号的转义给取消

regex = re.compile(pattern,flags=re.IGNORECASE) 
# re.IGNORCASE的作用是不区分大小写，所以我们pattern里面写A-Z就够了（其实多写了a-z也不麻烦）

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search

m = regex.search(text)

m #返回的对象m是一个匹配项对象

<re.Match object; span=(5, 20), match='dave@google.com'>

m.string

'Dave dave@google.com \nSteve steve@gmail.com \nRob rob@gmail.com \nRyan ryan@yahoo.com\n'

text

'Dave dave@google.com \nSteve steve@gmail.com \nRob rob@gmail.com \nRyan ryan@yahoo.com\n'

text[m.start():m.end()] # m的范围是[5, 20]

'dave@google.com'

match

只匹配出现在开头的模式

print(regex.match(text))

None

sub

把匹配到的模式（该例中是邮箱）替换成了指定字符串
sub还能通过诸如\1、\2之类的特殊符号访问各匹配项中的分组。符号\1对应第⼀个匹配的组，\2对应第⼆个匹配的组，以此类推。

regex.sub('REDACTED', text)

'Dave REDACTED \nSteve REDACTED \nRob REDACTED \nRyan REDACTED\n'

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3',text)) # 用户名，域名，域后缀

Dave Username: dave, Domain: google, Suffix: com 
Steve Username: steve, Domain: gmail, Suffix: com 
Rob Username: rob, Domain: gmail, Suffix: com 
Ryan Username: ryan, Domain: yahoo, Suffix: com

如果不仅想要找出电子邮件地址，还想将各个地址分成3个部分：用户名、域名以及域后缀。要实现此功能，只需将待分段的模式的各部分用圆括号包起来

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

regex = re.compile(pattern,flags=re.IGNORECASE)

m = regex.match('john9000@cba.edu.cn')

m.groups()

('john9000', 'cba.edu', 'cn')

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

总结：

方法	说明
findall、finditer	返回字符串中所有的非重叠匹配模式。findall返回的是由所有模式组成的列表，而finditer则通过一个迭代器逐个返回
match	从字符串起始位置匹配模式，还可以对模式各部分进行分组。如果匹配到模式，则返回一个匹配项对象，否则返回None
search	扫描整个字符串以匹配模式。如果找到则返回一个匹配项对象。跟match不同，其匹配项可以位于字符串的任意位置，而不仅仅是起始处
split	根据找到的模式将字符串拆分为数段
sub、subn	将字符串中所有的(sub)或前n个(subn) 模式替换为指定表达式。在替换字符串中可以通过\1. \2等符号表示各分组项

矢量化字符串函数

数据中如果有缺失值会不方便使用data.map等方法进行操作

import numpy as np
import pandas as pd

data = {'Dave': 'dave@google.com', 'Rob':'rob@gmail.com', 'Steve':'steve@gmail.com', 'Wes':np.nan}

data = pd.Series(data)

data

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

Series有⼀些能够跳过NA值的面向数组方法，进行字符串操作。可通过Series的str属性访问。
个人理解：str就相当于取原本的r’'相当于返回的是str的对象，所以可以用各种str方法,当然有一些方法是原有的str所没有的

df.str.contains()

data.str.contains('gmail') # 是否含有gmail这个字符串

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

df.str.findall()

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

df.str.match()

data.str.match(pattern,flags=re.IGNORECASE)

Dave     True
Rob      True
Steve    True
Wes       NaN
dtype: object

matches

Dave     True
Rob      True
Steve    True
Wes       NaN
dtype: object

以下两条原书写法会报错：

matches.str.get(1) # 报错AttributeError: Can only use .str accessor with string values!

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-113-bd3a29697b06> in <module>
----> 1 matches.str.get(1)


F:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5268             or name in self._accessors
   5269         ):
-> 5270             return object.__getattribute__(self, name)
   5271         else:
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):


F:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # http://www.pydanny.com/cached-property.html


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in __init__(self, data)
   2039 
   2040     def __init__(self, data):
-> 2041         self._inferred_dtype = self._validate(data)
   2042         self._is_categorical = is_categorical_dtype(data)
   2043         self._is_string = data.dtype.name == "string"


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 


AttributeError: Can only use .str accessor with string values!

matches.str[0] # 报错AttributeError: Can only use .str accessor with string values!

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-114-10bdd22fd8b2> in <module>
----> 1 matches.str[0]


F:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5268             or name in self._accessors
   5269         ):
-> 5270             return object.__getattribute__(self, name)
   5271         else:
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):


F:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # http://www.pydanny.com/cached-property.html


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in __init__(self, data)
   2039 
   2040     def __init__(self, data):
-> 2041         self._inferred_dtype = self._validate(data)
   2042         self._is_categorical = is_categorical_dtype(data)
   2043         self._is_string = data.dtype.name == "string"


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 


AttributeError: Can only use .str accessor with string values!

下面尝试去掉最后一个NaN以返回布尔类型数据，并再次使用str.get()

matches2 = data.drop('Wes').str.match(pattern,flags=re.IGNORECASE)

matches2

Dave     True
Rob      True
Steve    True
dtype: bool

matches2.str.get(1) #报错 AttributeError: Can only use .str accessor with string values!

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-130-884baf6cb16d> in <module>
----> 1 matches2.str.get(1)


F:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5268             or name in self._accessors
   5269         ):
-> 5270             return object.__getattribute__(self, name)
   5271         else:
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):


F:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # http://www.pydanny.com/cached-property.html


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in __init__(self, data)
   2039 
   2040     def __init__(self, data):
-> 2041         self._inferred_dtype = self._validate(data)
   2042         self._is_categorical = is_categorical_dtype(data)
   2043         self._is_string = data.dtype.name == "string"


F:\Anaconda3\lib\site-packages\pandas\core\strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 


AttributeError: Can only use .str accessor with string values!

看来当Series里面装着类似string的数据类型时无法使用str，尝试以下修改成功：

matches = data.str.findall(pattern, flags=re.IGNORECASE)

matches # matches内不是string

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

matches.str.get(1)

Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

matches.str[1]

Dave    NaN
Rob     NaN
Steve   NaN
Wes     NaN
dtype: float64

用str[]截取数据

data.str[:5]

Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object

总结

方法	说明
cat	实现元素级的字符串连接操作，可指定分隔符
count	返回表示个字符串是否含有指定模式的布尔型数组
extract	使用带分组的正则表达式从字符串Series提取-一个或多个字符串,结果是一一个 DataFrame,每组有一-列
endswith	相当于对每个元素执行x.endswithpatterm)
startswith	相当于对每个元素执行x.startswithlpattern)
findall	计算各字符串的模式列表
get	获取各元素的第i个字符
isalnum	相当于内置的str.alnum
isalpha	相当于内置的strisalpha
isdecimal	相当于内置的str.isdecimal
isdigit	相当于内置的strisdigit
islower	相当于内置的strislower
isnumeric	相当于内置的strisnumeric
isupper	相当于内置的strisupper
join	根据指定的分隔符将Series中各元素的字符串连接起来
len	计算各字符串的长度
lower,upper	转换大小写。相当于对各个元素执行x.lower)或x.upper()
match	根据指定的正则表达式对各个元素执行re.match,返回匹配的组为列表
pad	在字符串的左边、右边或两边添加空白符
center	相当于padl(side=‘both’)
repeat	重复值。例如，s.str.repeat(3)相当于对各个字符串执行 x*3
replace	用指定字符串替换找到的模式
slice	对Series中的各个字符串进行子串截取
split	根据分隔符或正则表达式对字符串进行拆分
strip	去除两边的空白符，包括新行
rstrip	去除右边的空白符
lstrip	去除左边的空白符

skywuuuu

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
3
评论
4.3 字符串的操作

4.3 字符串的操作字符串对象方法一些内置方法splitPython split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串，默认值为-1，代表切割全部。val = 'a,b, guido'val.split(',')['a', 'b', ' guido']stripPython strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列。注意：该方法只能删除开头或是结尾的字符,不能删除中间部分的字符。p
复制链接

扫一扫

专栏目录