【NLP基础笔记(一)——python对字符串的处理】

＿Carpediem

已于 2022-07-19 22:13:24 修改

阅读量999

点赞数 3

文章标签： python 自然语言处理

于 2022-07-18 15:01:23 首次发布

本文链接：https://blog.csdn.net/chariyan/article/details/125829989

版权

python对文字的处理：string模块+正则表达式模块

string：s=”a1a2...an”

string为有限序列，说明串的相邻字符间具有前驱后继关系

null string：s=”” / s=Φ

空格串：只有空格的string（有长度，可以不止一个空格）

主串：包含子串的串

string的比较：数字和字符串按ASCII，中文按unicode

ASCII码：由8位二进制表示一个字符，总表256个字符

Unicode码：由16位的二进制表示一个字符，总约65万字符

ASCII码与Unicode码前256个字符相同

string的逻辑结构：类似线性表

线性表：关注单个元素的增删改查；string：查子串位置，得子串位置，替换子串等
str=“123” -->123为字符不是数字

string的顺序存储结构：定长数组

string的链式存储结构：一个结点可存多个字符

string的链式存储结构常用来连接string与string，其余操作多用顺序存储

1️⃣python的string模块的常用函数

1.大小写函数

import string
str="abc"
print(str.upper())#ABC  指定字符串大写
str.upper()
print(str)#ABC  可以单独使用也可以放入print中 

print(str.lower())#abc 修改字符串
print("DEF".lower())#def  生成新字符串

str1="abc,dEf" #以，隔开
str2="abc dEf" #以  隔开
print(str1.title())#Abc,Def 字符串中所有单词首字母大写，其余小写
print(str1.capitalize())# Abc,def 字符串仅第一个单词首字母大写，其余小写
print(str1.swapcase()) #ABC,DeF 大小写反转

print(str2.title()) #  Abc Def
print(str2.capitalize()) #Abc def
print(str2.swapcase())#ABC DeF

2.is判断函数

import string
print("123".isdecimal())
#True   判断字符串是否全为数字
print("ab1".isalpha())
#False  判断字符串是否全为字母
print("ab1\".isalnum())
#False  判断字符串是否只含有数字和字母
print("ABC".isupper())
#True
print("Abc".islower())
#False
print("Abc,Def".istitle())
#True
print("n".isspace())
#True   判断字符串是否为空白符（空格、换行（n）、制表符（t））
print("t".isprintable())
#False  判断字符串是否可打印字符（只有空格可以，换行和制表符都不可以）
print("w12".isdentifier())
#True   判断字符串是否符合命名规则（只能字母/_开头，名字只能包含数字、字母和_）

关于对是否全为数字的判断有三个函数isdecimal（）、isdigit（）、isnumeric（），具体区别见python-10.菜鸟教程-6-str.isdecimal () 与str.isdigit()的区别_LTCM_SAKURA的博客-CSDN博客

3.字符串填充

center（width，fillchar）：填充物在字符串两边

ljust（width，fillchar）：填充物在字符串左边

rjust（width,fillchar）:填充物在字符串右边

当字符串比width小时才会扩充；字符串比width大时，返回字符串本身
fillchar="填充物"，默认为空格
import string
print("ww".center(8,"-"))
#---ww---

zfill（width）：居右填充，填充物固定为0

该函数会识别字符串的正负，若为"+"/"-"则不变，越过继续填充
import string
print("12".zfill(8))
#00000012
print("-12".zfill(8))
#-0000012
print("+1b".zfill(8))
#+000001b
print("#12".zfill(8))
00000#12

4.子串(位置)搜索

count（sub[,start[,end]]）：判断指定字符串是否具有子串sub，若有返回出现次数

[,start[,end]]含义：

start和end代表搜索边界，若无则代表全字符串搜索
start默认为0，end默认为string长度
只有一个数字则默认表示start到最后一个字符
import string
print("whyiswhywh".count("why"))
#2
print("whyiswhywh".count("why",1))
#1 字符串从0开始计数，所以从第二个字符串开始查，即搜索"hyiswhywh"
print("whyiswhywh".count("why",1,5))
#0

5，字符串开始与结尾判断（返回布尔值）

startswith(prefix[,start[,end]]):判断函数的开始字符串是否为prefix

endswith(suffix[,start[,end]])：判断函数的结尾字符串是否为suffix

import string
print("whyiswhywh".startswith("hy",1))
#True
print("whyiswhywh".endswith("why",8))
#False

6.字符串位置

find（sub[,start[,end]] ）：返回sub第一个字符第一次出现的位置，若无则返回-1

rfind（sub[,start[,end]]）：返回从右开始数sub第一个字符第一次出现的位置，无则返-1

index（sub[,start[,end]]）：返回sub第一个字符第一次出现的位置，若无则报错

rindex（sub[,start[,end]]）：返回从右开始数sub第一个字符第一次出现的位置，无则报错

import string
str="whyiswhywh"
print(str.find("hy"))
#1
print(str.rfind("wh"))
#8
print(str.rfind("who"))
#-1
print(str.index("who"))
#valueError:substring not found

7.字符串替换

replace(old,new[,count]):

old为旧string，new为新string，count可选代表更改个数
若无指定old ，则输出原字符串

import string
str="whyiswhywh"
print(str.replace("wh","12",2))
#12yis12ywh"
print(str.replace("ad","12"))
#"whyiswhywh"

8.字符串分割

partition（sep）：将string分为sep前、sep、sep后三部分

rpartition（sep）：返回结果同上

区别：当string中不存在sep时

partition（sep）：分为原string、空白、空白
rpartition（sep）：分为空白、空白、原string
import string
str="whyiswhywh"
print(str.partition("is"))
#('why','is','whywh')
print(str.rpartition("is"))
#('why','is','whywh')
print(str.rpartition("am"))
#('','','whyiswhywh')

split(sep=None,maxsplit=-1):根据sep将字符串切割maxsplit次，返回分割后字符列表

rsplit(sep=None,maxsplit=-1):从右向左遍历，根据sep切割maxsplit次，返回列表

sep为切割条件，默认为空格
maxsplit为切割次数；maxsplit=-1/None：从左到右每一个sep切割一次

import string
str=input() #假设输入值为"a;b;c"
print(str.split(";",0))
#['a;b;c']
print(str.split(";",1))
#['a','b;c']
print(str.split(";",-1))
#['a','b','c']
print(str.split(";",1)[1])
#b;c

#将分割内容分别保存
u1,u2,u3=str.split(";",-1)
print(u1)
#a

#去除换行符
str1='''hello
python'''
print(str1)
#hello
python
print(str1.split('\n'))
#['hello','python']

#灵活应用
str2="hello<[www.csdn.cn]>bye"
print(str2.split('[')[1].split(']')[0])
#www.csdn.cn
print(str2.split('[')[1].split(']')[0].split('.'))
#['www','csdn','cn']

9.字符串连接

(1) +连接

str1="123"
str2="abc"
print(str1+str2)
#123abc

(2) join()：将可迭代数据用字符串连接起来

可迭代数据：string、list、tuple、dict、set（每个参与迭代的元素只能是string类型，不能数字）

import string 
a="why"  #字符串类型
print('_'.join(a))
#w_h_y

b=('a','b','c')  #tuple类型
print('='.join(b))
#a=b=c

c={"why","xy"}  #set类型
print(" ".join(c))   #注意：这里空格必须打出来
#why xy

10.字符串的修整

strip([chars]):删除前导和尾随指定字符串char

lstrip([chars]):只删字符串左侧（开头）的指定char

rstrip([chars]):只删字符串右侧（结尾）的指定char

没有参数则默认为删除空格、制表符、换行符
注意：移除到非char为止
import string
a="    whyiswhy    "
print(a.strip())
#whyiswhy
print("wwhyiswhy    ".lstrip('w'))
#hyiswhy
print("  whyiswhy".rstrip('why'))
#  whyis

2️⃣ python正则表达式：re模块

1.说明

○正则表达式（regular expression，regex，RE）：用来简洁表达一组字符串特征的表达式，主要应用于字符串模式匹配。

○使用正则表达式步骤：

（1）寻找规律

（2）使用正则符号表示规律

（3）提取信息（有一个字符不匹配都会匹配失败）

○正则符号：

特殊字符：

'.' :匹配（代替）除了换行符（'\n'）之外的任意单个字符

'*':前面的子表达式可以出现0~无限次

'?':前面的子表达式可以出现0~1次

'$':匹配一行的结尾（必须放在正则表达式最后面）

'^':匹配一行的开头（必须放在正则表达式的最前面）

'+':前面的子表达式可以出现1~无限次

'|':两项都进行匹配

'()':提取括号内容

'[ ]':代表一个集合

[abc]:能匹配其中的单个字符
[a-z0-9]:能匹配指定范围的字符，可（^）取反
[2-9][1-3]:能够做组合匹配

'{}':标记前面子表达式出现的频率

｛n,m｝:最少n次，最多m次
｛n，｝:最少n次，最多无限
｛n｝:必须出现n次

预定义字符：

'\d':匹配十进制数0-9

'\D':匹配非数字，包括下划线

'\s':匹配空白字符（空格、TAB等）

'\S':匹配非空白字符，包括下划线

'\w':匹配字母、汉字、数字 a-z A-Z 0-9

'\W':匹配非字母、汉字、数字，包含下划线

反斜杠在正则表达式中（甚至python中）不能单独使用，所以'\'需转义：

str="\\123 233" #\123 233

str=r"\123 233" #\123 233

○使用re模块的一般步骤：

（1）将正则表达式的字符串形式编译为Pattern实例

（2）使用Pattern实例处理文本并获得匹配结果（一个Match实例）

（3）使用Match实例获得信息，并进行其他的操作

import re

pattern = re.compile(r'hello.*\!') #（1）
match = pattern.match('hello,why! How are you?')  #(2) 
#获得匹配结果，无法匹配使将返回None

if match:
    print(match.group())  #(3)

2.Pattern

Pattern对象是一个编译好的正则表达式，通过Pattern提供的一系列方法可对文本进行匹配查找

Pattern不能直接实例化，必须使用re.compile()进行构造

Pattern属性：

1. pattern: 编译时用的表达式字符串。
2. flags: 编译时用的匹配模式。数字形式。
3. groups: 表达式中分组的数量。
4. groupindex: 以表达式中有别名的组的别名为键、以该组对应的编号为值的字典，没有别名的组不包含在内。

Pattern对象的方法：

1. match（）
注意：这个方法并不是完全匹配。当pattern结束时若string还有剩余字符，仍然视为成功。想要完全匹配，可以在表达式末尾加上边界匹配符'$'。 *
2. search（）
这个方法从string的pos下标处起尝试匹配pattern。如果pattern结束时仍可匹配，则返回一个Match对象；若无法匹配，则将pos加1后重新尝试匹配，直到pos=endpos时仍无法匹配则返回None。

3.re.compile(pattern[,flag]):将正则表达式strPattern编译为Pattern对象

pattern：正则模式
flag：匹配模式

flag可选值：

re.I(re.IGNORECASE):忽略大小写

re.M(MULTILINE):多行模式，改变'^'和'$'的行为

re.S(DOTALL):点任意匹配模式，改变'.'的行为

re.L(LOCALE):使预定义字符（\b\B\w）等取决于当前区域设定

re.U(UNICODE):使预定义字符（\b\B\w）等取决于unicode定义的字符属性

re.X(VERBOSE):详细模式（此模式下正则表达式可以是多行，可加注释，忽略空白符）
#以下两个正则表达式等价
regex_1 = re.compile(r"""\d + #数字部分
                         \.   #小数点部分
                         \d * #小数的数字部分''',re.X)
regex_2 = re.compile(r''\d+\.\d*'')

返回值：Pattern对象（单独使用compile函数没有意义，需和findall()、search()、match()函数搭配使用）

compile（）+findall（）：返回一个列表

import re
def main():
   context = "Hello, I am why, from dalian, nice to meet you……"
   regex = re.compile('\w*o\w*')
   x = regex.findall(context)
   print(x)
if __name__ == '__main__':
   main()
#['Hello','from','to','you']

compile（）+match（）：返回一个class、string、tuple、dict
注意：match（）从位置0开始匹配，匹配不到返回None（此时没有span/group属性，并且与group使用，返回一个单词'Hello'后匹配就会结束）
import re
def main():
  context = 'Hello, I am why, nice to meet you……'
  regex = re.compile('\w*o\w*')
  y = regex.match(context)
  print(y)              #<_sre.SRE_Match object; span=(0, 5), match='Hello'>
  print(type(y))        #<class '_sre.SRE_Match'>
  print(y.group())      #Hello
  print(y.span())       #(0, 5)
  print(y.groupdict())  #{}

if __name__ == '__main__':
  main()

compile（）+search（）：返回类型与match差不多
注意：search（）可以不从位置0开始匹配，但匹配一个单词以后也会结束匹配

import re
def main():
  context = 'Hello, I am why, nice to meet you……'
  regex = re.compile('\w*o\w*')
  z = regex.search(context)
  print(z)              #<_sre.SRE_Match object; span=(0, 5), match='Hello'>
  print(type(z))        #<class '_sre.SRE_Match'>
  print(z.group())      #Hello
  print(z.span())       #(0, 5)
  print(z.groupdict())  #{}

if __name__ == '__main__':
  main()

隐藏compile（）：不用re.compile,直接使用re.对应方法（pattern,string,flag=0）即可 [原因：正则表达式方法自带compile]

texts = [包含一百万个字符串的列表]
pattern = re.compile('正则表达式')
for text in texts:
   pattern.search(text)
#执行了一次re.compile
texts = [包含一百万个字符串的列表]
for text in texts:
  re.search('正则表达式',text)
#并没有执行一百万次re.compile,因为_compile自带缓存，
#只要是同一个正则表达式，同一个flag，第二次调用直接读取缓存
#除非项目涉及几百万以上的正则表达式查询

4. re.match（pattern， string， flags=0）:

从0位置开始匹配一个符合正则表达式的字符串，匹配成功返一个对象，不成功返None

match对象的属性：

re.：匹配时调用Pattern对象

pattern：正则模型

string：要匹配的字符串（可以是一个对象）

flags：匹配模式

pos/endpos：文本中正则表达式开始/结束搜索的索引；值与Pattern.match()Pattern.search()方法的同名参数相同

lastindex：最后一个被捕捉的分组在文本中的索引；若无被捕获的分组为None

lastgroup：最后一个被捕获的分组的别名；若此每组无别名或无被捕获的分组则为None
import re
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello Why!')
#（）表示分组
#?P<>定义组里匹配内容的key(键)，<>里面写key名称，值就是匹配到的内容
print('m.string:', m.string)
#m.string: hello Why!
print('m.re:', m.re)
#m.re: re.compile('(\\w+) (\\w+)(?P<sign>.*)')
print('m.pos:', m.pos)
#m.pos: 0
print('m.endpos:', m.endpos)
#m.endpos: 10
print('m.lastindex:', m.lastindex)
#m.lastindex: 3
#正则表达式里三对括号
print('m.lastgroup:', m.lastgroup)
#m.lastgroup: sign
match对象的方法：

group（num）：获得num个分组匹配到结果的内容（num>1,则以tuple形式返回）

num可以是数字编号或者'string'(匹配内容：组名)
group（）：无参数默认为num=0，返回所有匹配结果
没有匹配的组则返回None；匹配了多次的组返回最后一次匹配的substring
注意：使用group时不要使用span

groups（）：以元组形式返回所有匹配结果的内容【相当于调用group（1,2,…last）】

groups（[default]）:表示没有匹配字符串的组以这个值代替，默认为None

groupdict（）：返回以有别名的组的别名为键、以该组匹配的子串为值的字典

span（[group]）：返回（start（[group]），end（[group]））

start（[group]）：返回指定组匹配的子串在string中的起始索引，group默认值为0
end（[group]）：（同上）区别为返回结束索引（子串最后一个字符索引+1）

expand（template）：将匹配到的分组带入template中然后返回

\id = \g（引用分组）、不能用编号0
\10:第十个分组
\g<1>0:\1之后是字符'0' （以上均为template取值）
import re
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello Why!')
print("m.group():", m.group())
#m.group(): hello Why!
print("m.group(1,2):", m.group(1, 2))
#m.group(1,2): ('hello', 'Why')
print("m.groups():", m.groups())
#m.groups(): ('hello', 'Why', '!')
print("m.groupdict():", m.groupdict())
#m.groupdict(): {'sign': '!'}
#?P<sign>中sign为key，值为匹配到的?P<sign>后的内容
print("m.start(2):", m.start(2))
#m.start(2): 6
print("m.end(2):", m.end(2))
#m.end(2): 9
print("m.span(2):", m.span(2))
#m.span(2): (6, 9)
print(r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3'))
#m.expand(r'\2 \1\3'): Why hello!

5.re.search(pattern, string, flags=0)：与match函数功能相同

与match函数的区别：match函数只检测re是不是在string的开始位置匹配，而search函数会扫描整个string查找匹配；（也就是说，match函数只有在0位置匹配成功才返回一个对象）

6.re.findall(pattern, string, flags=0):浏览全部字符串，将匹配到的结果内容放在一个列表中，未匹配成功就返回空列表

注意：匹配成功的字符串不再参与下一次匹配
注意：如果没写匹配规则，返回一个比原始字符串多一个空字符串列

import re
print(re.findall("","a2b3c4d5"))   #无匹配规则
#['', '', '', '', '', '', '', '', '']    8个字符9个''
print(re.findall("\d+\w\d+","a2b3c4d5"))  #无分组
#['2b3', '4d5']
print(re.findall("(ca)*","ca2b3caa4d5"))  #有分组（相当与groups（））
#['ca', '', '', '', 'ca', '', '', '', '', '']
print(re.findall("(a)(\w+)","ca2b3 caa4d5"))  #多个分组
#[('a', '2b3'), ('a', 'a4d5')]
print(re.findall("(a)(\w+(b))","ca2b3 caa4d5"))  #分组中有分组
#[('a', '2b', 'b')]
print(re.findall("a(?:\w+)","ca2b3 caa4d5"))    # ?:情况
#['a2b3', 'aa4d5']

7.re.finditer(pattern, string, [,flags]):搜索string，返回一个顺序访问每个匹配结果（Match对象）的迭代器

import re
p = re.compile(r'\d+')
for m in p.finditer("one1two2three3"):
    print(m.group())
#1
#2
#3

8.re.split(pattern, string, maxsplit=0, flags=0):根据正则匹配分割字符串，返回分割后的一个列表【maxsplit：指定分割个数，不指定则将全部分割】

import re
print(re.split("a\w",'whyabxyacyacde'))
#['why', 'xy', 'y', 'de']
print(re.split("a\w",'whyabxyacyacde',maxsplit=2))
#['why', 'xy', 'yacde']

9.re.sub(pattern,repl,string[,count]):用repl替换string中每个匹配的子串后返回替换后的字符串

repl为一个string时：可以使用\id或\g、\g引用分组，不可用编号0
repl为一个方法时：此方法只接受一个参数（Match对象），并返回个字符串用于替换
count：指定最多替换次数，不指定则全部替换

10.re.subn(pattern, repl, string[,count]):返回（sub(repl,string[,count]),替换次数）

import re
p = re.compile(r'(\w+) (\w+)')
str = 'i say, hello Why!'
print(re.sub(r'(\w+) (\w+)',r'\2 \1', str))
#say i, Why hello!
print(re.sub(r'(\w+) (\w+)','',str)) #数据清洗时 通常找到无用的子串 替换为空
#, !
print(re.subn(r'(\w+) (\w+)',r'\2 \1', str))
#('say i, Why hello!', 2)
print(re.subn(r'(\w+) (\w+)','',str))
#(', !', 2)

#repl为方法时
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
 
print(p.sub(func, str))
#I Say, Hello Why!

3️⃣python正则表达式的简单例子

爬取数据后，我们通常用正则表达式对数据进行清洗或提取有用的信息，最后得到'干净的'数据。

当然，清洗和提取的方法有很多，比如beautifulSoup、css标签选择器等，不过这些都在一些特定的情况下有用，而正则表达式是通用的。

import requests as rq #引入工具库
import re
 
page = rq.get("https://baike.sogou.com/v231013.htm") # 发送请求 #搜狗百科 

print(page.status_code) # 返回状态码正常

#使用正则表达式对网页文本进行抽取
title_pattern = re.compile(r'<h1 id="title".*?>(.*?)</h1>') 
title = title_pattern.search(page.text) 
print(title.group(1))
 
# 词条正则表达式抽取
content_pattern = re.compile(r'<p>(.*?)<\\/p>') 
contents = content_pattern.findall(page.text) 
print(contents)
 
print(list(map(lambda x:re.sub("<a .*?>|<\\\/[ab]>", "",x), contents)))