Python实践手册-字符串处理和正则表达式

最新推荐文章于 2022-04-19 15:15:14 发布

J_Xiong0117

最新推荐文章于 2022-04-19 15:15:14 发布

阅读量382

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/u013010473/article/details/103848356

版权

python 专栏收录该内容

104 篇文章 4 订阅

订阅专栏

文章目录

1.字符串的反转

st = "this is a string"
## 方法一
''.join(reversed(st))

'gnirts a si siht'

## 方法二
st[::-1]

'gnirts a si siht'

2.字符串的串联

my_str = ['this','is','a','string']
' '.join(my_str) ## 用空格连接字符串

'this is a string'

3.字符串的字节长度

length1 = len('this is a string')
length2 = len('this is a string'.encode('utf-8'))
print(f'length1:{length1}')
print(f'length2(字节长度):{length2}')

length1:16
length2(字节长度):16

length3 = len('中国梦')
length4 = len('中国梦'.encode('utf-8'))
print(f'length3:{length3}')
print(f'length4(字节长度):{length4}')

length3:3
length4(字节长度):9

4.正则查找匹配串索引

import re

s1 = 'my name is jack,my code is 9527'
pat = 'jack'
r = re.search(pat,s1)  #re.research()查找第一个匹配串
print(r.span())

(11, 15)

s2 = '你好！我想了解下健康保险和医疗保险的区别'
pat = '保险'
r = re.finditer(pat,s2)
for i in r:
    print(i)

<re.Match object; span=(10, 12), match='保险'>
<re.Match object; span=(15, 17), match='保险'>

5.匹配数字[0-9]

findall找出全部位置的所有匹配数字，\d表示0-9这10个数字的通用字符，+表示匹配1次或多次

s = '一共20行代码运行时间13.59s'
pat = r'\d+'  
r = re.findall(pat,s)
print(r)

['20', '13', '59']

6.匹配浮点数

?表示匹配0或1次

s = '一共20行代码运行时间13.59s,8'
pat = r'\d+\.?\d+'  
r = re.findall(pat,s)
print(r)

['20', '13.59']

7.匹配字符串的开头(^)

s1 = 'my name is jack,my code is 9527'
pat = r'^[emrt]'
r = re.findall(pat,s1)
print(r)

['my name is jack,my code is 9527']

8.忽略大小写(re.I)

s = 'That'
pat = r't'
r = re.findall(pat,s,re.I)
r

['T', 't']

9.compile的作用

如果要做很多次的匹配，可以先编译匹配串（\W 匹配不是数字和字母/汉字的字符）

import re

pat = re.compile('\W+')
## 第一次调用pat正则编译对象做匹配
has_special_chars = pat.findall('ed#2@ed*&$cz中国梦！')
if has_special_chars:
    print(f'句子包含特殊字符：{has_special_chars}')

## 第二次调用pat正则编译对象做匹配
again_pattern = pat.findall("jack18588951684@gmail.com")
if  '@' in again_pattern:
    print('这可能是邮箱地址！')

句子包含特殊字符：['#', '@', '*&$', '！']
这可能是邮箱地址！

10.使用（）捕获单词，不想带空格

s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s([a-zA-Z]+)'
r = re.findall(pat,s)
print(r)

['module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

看到提取单词中未包括第一个单词，使用 ? 表示前面字符出现0次或1次，但是此字符还有表示贪心或非贪心匹配含义，使用时要谨慎。

s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s?([a-zA-Z]+)' 
r = re.findall(pat,s)
print(r)

['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

11.英文分词

## 方法一:re.split
s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s+'
r = re.split(pat,s)
print(r)

['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

## 方法二：str自带的split函数
res = s.split(' ')
print(res)

['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

但是，对于风格符更加复杂的情况，split无能为力，只能使用正则

s = 'This,,,  module ; \t  provides|| regular; '
words = re.split('[,\s;|]+',s)
print(f"words:{words}")
words = [i for i in words if len(i)>0]
print(f"words(去掉空字符串后):{words}")

words:['This', 'module', 'provides', 'regular', '']
words(去掉空字符串后):['This', 'module', 'provides', 'regular']

12.从指定开始位置匹配match

match函数：从字符串指定位置开始匹配

import re

mystr = 'This'
pat = re.compile('hi')
r1 = pat.match(mystr)
print(f'r1:{r1}')
r2 = pat.match(mystr,1) # 从指定位置“1”开始匹配
print(f'r2:{r2}')

r1:None
r2:<re.Match object; span=(1, 3), match='hi'>

search函数：从字符串任意位置开始匹配

mystr = 'This'
pat = re.compile('hi')
r3 = pat.search(mystr)
print(f'r3:{r3}')

r3:<re.Match object; span=(1, 3), match='hi'>

13.替换匹配的子串

sub函数实现对匹配子串的替换

import re

content = "你好！我刚来中国不久，我想了解下你们公司都有哪些类型的保险？"
source = ['中国','保险']
target = ['China','Insurance']
for i in range(len(source)):
    pat = re.compile(source[i]) # 要被替换的部分
    content = pat.sub(target[i],content)
print(content)

你好！我刚来China不久，我想了解下你们公司都有哪些类型的Insurance？

14.贪心捕获和非贪心捕获

content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat1 = re.compile(r"<div>(.*)</div>")  # 贪心模式(.*)
pat2 = re.compile(r"<div>(.*?)</div>") # 非贪心模式(.*?)
res1 = pat1.findall(content)
res2 = pat2.findall(content)
print(f"content：{content}")
print(f"贪心捕获结果：{res1}")
print(f"非贪心捕获结果：{res2}")

content：<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc
贪心捕获结果：['graph</div>bb<div>math']
非贪心捕获结果：['graph', 'math']

15.常用元字符

元字符

16.常用通用字符

通配符

17.密码安全检查

1）密码安全要求：a)要求密码为6到20位; b)密码只包含英文字母和数字;

2）选用最保险的 fullmatch 方法，查看是否整个字符串都匹配

pat = re.compile(r'\w{6,20}') # 错误！因为\w通配符匹配的是字母，数字和下划线，题目要求不能含有下划线
pat = re.compile(r'[\da-zA-Z]{6,20}')
res1 = pat.fullmatch('qaz12') # 密码长度小于6
print(f"'res1':{res1}")
res2 = pat.fullmatch('qaz12wsxedcrfvtgb67890942234343434') # 密码长度大于20
print(f"'res2':{res2}")
res3 = pat.fullmatch('qaz_123') # 密码含有下划线
print(f"'res3':{res3}")
res4 = pat.fullmatch('n0passw0Rd') # 密码符合要求
print(f"'res4':{res4}")

'res1':None
'res2':None
'res3':None
'res4':<re.Match object; span=(0, 10), match='n0passw0Rd'>

18.爬取百度首页标题

import re 
from urllib import request

data = request.urlopen("http://www.baidu.com/").read().decode()

pat =  r'<title>(.*?)</title>'

result = re.search(pat,data)
print(result)
result.group()

<re.Match object; span=(1389, 1413), match='<title>百度一下，你就知道</title>'>





'<title>百度一下，你就知道</title>'

19.批量转化为驼峰格式

import re

def camel(s):
    s = re.sub(r"(\s|_|-)+"," ",s).title().replace(" ","")
    return s[0].lower()+s[1:]

def batch_camel(slist):
    return [camel(s) for s in slist]

s = batch_camel(['student_id', 'student\tname', 'student-add'])
print(s)

['studentId', 'studentName', 'studentAdd']

J_Xiong0117

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录