从C到Py：Python的字符串及正则表达式

最新推荐文章于 2024-09-23 14:24:34 发布

zhuoer_GG

最新推荐文章于 2024-09-23 14:24:34 发布

阅读量1.1k

点赞数 11

文章标签： python 开发语言 c语言正则表达式

本文链接：https://blog.csdn.net/zhuoer_GG/article/details/138013871

版权

在本篇博客中，我们将讲解Python中有关字符串及正则表达式的知识，在之前的文章中有较浅地提及字符串相关知识，但还不是比较完整。本次将较详细地介绍有关操作和诸多函数。

字符串的常用操作

字符串时是Python中的不可变数据类型，下面介绍有关的操作函数。

1、字符串的操作函数

str.lower() 全部转为小写，得到一个新字符串

str.upper() 全部转为小写，得到一个新字符串

str.split(sep=None) 将str按sep分隔，结果为列表类型

str.count(sub) 计算sub在str中出现的次数

str.find(sub) 查询sub在str中是否存在，不存在，为-1，存在，为首次出现的索引

str.index(sub) 功能与find()相同，但是不存在时报错

str.startswith(s) 查询str是否以s开头

str.endswith(s) 查询str是否以s结尾

str.replace(old, news) 用news替换s中的所有old，得到一个新字符串。该函数还有第三个参数，表示替换的次数，默认是全部。

s='Helloworld'
new_s=s.replace('o', '你好', 1)
print(new_s)
# 结果是：Hell你好world

str.center(width, fillchar) 将str置于指定宽度内居中位置，空白部分可用fillchar填充。

s='Helloworld'
print(s.center(20))
print(s.center(20, '*'))
# 结果为：
#     Helloworld     
#*****Helloworld*****

str.join(iter) 在iter中的每个元素后都增加一个str。

str.strip(chars) 从str中去掉左侧、右侧在chars中列出的字符串。

str.lstrip(chars) 从str中去掉左侧在chars中列出的字符串。

str.rstrip(chars) 从str中去掉右侧在chars中列出的字符串。

s='   Hello   World   '
print(s.strip())
print(s.lstrip())
print(s.rstrip())

#Hello   World
#Hello   World   
#   Hello   World

2、格式化字符串的三种函数

①占位符

%s：字符串

%d：十进制

%f：浮点

这里占位符的使用可以联想C语言中printf函数在打印变量时的用法。

② f-string

用{}来标明被替换的字符。这种方法需要在Python3.6之后使用。

③ str.format

模板字符串.format(用逗号分隔的参数）

下面我们还是用几个例子来演示：

name='abc'
age=18
score=98.2
print('姓名:%s,年龄:%d,成绩:%.1f' %(name, age, score))

print(f'姓名:{name},年龄:{age},成绩:{score}')
# 一定要在字符串前加上f
print('姓名:{0},年龄:{1},成绩:{2}'.format(name, age, score))
print('姓名:{2},年龄:{0},成绩:{1}'.format(age, score, name))
# {}中的数字要与format的参数位置对应好
# 以上输出结果均是：
# 姓名:abc,年龄:18,成绩:98.2

3、格式化格式

: 引导符号

填充用于填充单个字符

对齐方式 <:左对齐 >:右对齐 ^:居中对齐

宽度字符串输出宽度

, 数字千位分隔符（只适用于整数、浮点数）

.精度 浮点数小数部分的最大输出长度

类型整数：b\d\o\x\X 浮点数：e\E\f\%

4、编码和解码

从str类型到bytes类型再到str类型。

①编码：

str.encode(encoding='utf-8', errors='strict/ignore/replace')

②解码：

bytes.decode(encoding='utf-8', errors='strict/ignore/replace')

encoding为编码格式。errors是出错的处理情况，其中strict为严格，即报错；ignore为忽略；replace为替换成？。

s='伟大的党,中国'
scode=s.encode(errors='replace')
print(scode)
# 按默认格式编码后结果：b'\xe4\xbc\x9f\xe5\xa4\xa7\xe7\x9a\x84\xe5\x85\x9a,\xe4\xb8\xad\xe5\x9b\xbd'
scode_gbk=s.encode('gbk', errors='replace')
print(scode_gbk)
# 按gbk格式：b'\xce\xb0\xb4\xf3\xb5\xc4\xb5\xb3,\xd6\xd0\xb9\xfa'
ss='耶✌︎︎'
scode_error=ss.encode('gbk', errors='ignore')
print(scode)
# 出错忽略：b'\xe4\xbc\x9f\xe5\xa4\xa7\xe7\x9a\x84\xe5\x85\x9a,\xe4\xb8\xad\xe5\x9b\xbd'
print(bytes.decode(scode_gbk, 'gbk'))
print(bytes.decode(scode, 'utf-8'))
# 解码后：
# 伟大的党,中国
# 伟大的党,中国

数据验证

数据的验证是指程序对用户输入的数据进行“合法”性验证。

以下函数均是对字符串中的字符进行验证，如果符合条件则放回True（结果均为Bool类型）

str.isdigit 所有字符都是数字（十进制阿拉伯数字）

str.isnumeric 所有字符都是数字，包括中文数字一、二、...

str.isalpha 所有字符都是字母（包含中文字符）

str.isalnum 所有字符都是数字或字母（包含中文字符）

str.islower 所有字符都是小写

str.isupper 所有字符都是大写

str.istitle 所有字符都是首字母大写

str.isspace 所有字符都是空白字符（\n、\t等）

字符串的处理

1、拼接

拼接有三种方法：

①str.join()

s='world'
print('*'.join(['hello', 'abc', s]))
# 结果为
# hello*abc*world

②直接拼接

print('hello''world')

③用格式化字符串拼接

s1='hello'
s2='world'
print('%s%s' % (s1, s2))
print(f'{s1}{s2}')
print('{0}{1}'.format(s1, s2))
# 均能得到helloworld

2、去重

通过拼接+not in 或者索引+not in的方式，均能实现字符串的去重操作，下面是两种代码实例。

s='slfnvndfoieuhvitbnlvmzpmvauubi'
# 随便输入的字符串
new_s=''
for item in s:
    if item not in new_s:
        new_s+=item
print(new_s)
news=set(s)
lst=list(news)
lst.sort(key=s.index)
print(''.join(lst))
# 均能得到slfnvdoieuhtbmzpa

正则表达式

正则表达式是一个特殊的字符序列，能帮助用户便捷地检查一个字符串是否符合某种格式。

1、元字符

这是一类有特殊意义的专用字符，比如“^”、“$”表示匹配的开始和结束等，具体有以下几类：

元字符	描述说明	举例	结果
.	匹配任意字符（除\n）	'p\nytho\tn'	p、y、t、h、o、\t、n
\w	匹配字母、数字、下划线	'python\n123'	p、y、t、h、o、n、1、2、3
\W	匹配非字母、数字、下划线	'python\n123'	\n
\s	匹配任意空白字符	'python\t123'	\t
\S	匹配任意非空白字符	'python\t123'	p、y、t、h、o、n、1、2、3
\d	匹配任意十进制数	'python\t123'	1、2、3

2、限定符

用于限定匹配的次数

限定符	描述说明	举例	结果
？	匹配前面的字符0次或1次	colou?r	可以匹配color或colour
+	匹配前面的字符1次或多次	colou+r	可以匹配color或colouu...r
*	匹配前面的字符0次或多次	colou*r	可以匹配color或colouu...r
{n}	匹配前面的字符n次	colou{2}r	可以匹配colouur
{n,}	匹配前面的字符最少n次	colou{2,}r	可以匹配colouur或colouuu...r
{n,m}	匹配前面的字符最小n次，最多m次	colou{2,4}r	可以匹配colouur或colouuur或colouuuur

3、其他

其他字符	描述说明	举例	结果
区间字符[]	匹配[]中所指定的字符	[.?\|] [0-9]	匹配标点符号点、问号，感叹号匹配0、1、2、3、4、5、6、7、8、9
排除字符^	匹配不在[]中指定的字符	[^0-9]	匹配除0、1、2、3、4、5、6、7、8、9的字符
选择字符\|	用于匹配\|左右的任意字符	\d{18}\|\d{15}	匹配15位身份证或18位身份证
转义字符	同Python中的转义字符	\.	将.作为普通字符使用
[\u4e00-\u9fa5]	匹配任意一个汉字
分组()	改变限定符的作用	six\|fourth (six\|four)th	匹配six或fourth 匹配sixth或fourth

4、re模块

这是一个Python中的内置模块，用于实现Python中的正则表达式操作

函数	功能描述
re.match(pattern, string, flags=0)	从字符串的开始位置匹配，若起始位置匹配成功，结果位Match对象，否则为None
re.search(pattern, string, flags=0)	在整个字符串中搜索第一个匹配的值，如果匹配成功，结果为Match对象，否则为None
re.findall(pattern, string, flags=0)	在整个字符串中搜索所有符合正则表达式的值，结果是一个列表类型
re.sub(pattern, repl,string, count, flags=0)	实现对字符串中指定子串的替换
re.split(pattern, string, max split, flags=0)	同字符串中split()方法的功能，都是分隔字符串

以下是这些函数的用法实例合集：

import re
# 导入模块，这部分知识之后讲解
pattern='\d\.\d+'   # 模式字符串
s='I study Python 3.12 every day'
match=re.match(pattern, s, re.I)
print(match)  # None
ss='3.12 Python I study every day'
match2=re.match(pattern, ss)
print(match2)  # <re.Match object; span=(0, 4), match='3.12'>

print('匹配值的起始位置:', match2.start())
print('匹配值的结束位置:', match2.end())
print('匹配区间的位置元素:', match2.span())
print('将匹配的字符串:', match2.string)
print('匹配的数据:', match2.group())
# 这一部分可以自己输出来查看
print('-'*50)
s='3.12 I study Python everyday and I Python 2.7 love you'
match=re.search(pattern, s)
print(match)  # <re.Match object; span=(0, 4), match='3.12'>
print(match.group())  # 3.12
lst=re.findall(pattern, s)
print(lst)  # ['3.12', '2.7']
print('-'*50)

pattern='黑客|破解|反爬'
s='我想学习Python,想破解一些VIP视频,Python可以实现无底线反爬吗?'
new_s=re.sub(pattern, 'XXX', s)
print(new_s)  # 我想学习Python,想XXX一些VIP视频,Python可以实现无底线XXX吗?

s='https://blog.csdn.net/zhuoer_GG/article/details/136120487?spm=1001.2014.3001.5501'
pattern='[?|=]'
lst=re.split(pattern, s)
print(lst)
# ['https://blog.csdn.net/zhuoer_GG/article/details/136120487', 'spm', '1001.2014.3001.5501']