正则

最新推荐文章于 2022-12-08 22:35:44 发布

xiaogeldx

最新推荐文章于 2022-12-08 22:35:44 发布

阅读量319

点赞数 1

本文链接：https://blog.csdn.net/xiaogeldx/article/details/86094582

版权

正则

正则用于匹配或者提取字符串
判断一个字符串是否匹配给定的格式
从一个字符串中按指定格式提取信息
两个参数
- 正则表达式的规则
- 要匹配的字符串

匹配网易邮箱

import re
a = '123456778@163.com'
my_mail = re.findall(r'^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.com$',a)		
#r取消Python转义\取消正则表达式的符号转义
print(my_mail)

如果格式不符合则为匹配不到（[ ]）
可以写一些规则，我们通过自己写的规则就能匹配你想要的字符串
```
  nu = input("请输入电话号码:")
  def phone_number(num):
      if len(num) == 11 and num.isdigit() and num.startswith("1"):
          print("Phone Number is Right")
          return True
      else:
          print("Phone Number is Error")
          return False
  a = phone_number(nu)
```
通过对比，可以很明显的发现，下面这种方式能够简单快捷的匹配出电话号码
import re
r = re.match(r"1\d{10}",a)
- 结果默认是以列表存储
- 匹配不到结果显示[ ]
  正则表达式：
正则表达式是一种通用的用来简洁表达一组字符串的表达式，因此，正则表达式是和Python无关的，在其他的语言或者不同的系统中，是通用的
正则表达式里不要随意加空格
通过正则表达式就可以去匹配现有的字符串
通过正则匹配，可以迅速的过滤出我们需要的全部或者一部分字符串，查找文本中的特质值（如病毒）等等

元字符

正则表达式每个元字符都能结合使用
re.search(“a”,“abc”) #匹配出a
re.search(".",“ab.cd.de”) #没有匹配出.（点），而是匹配出a
这个不是不能匹配出.(点)，而是匹配任意字符，这个点已经被赋予了特殊的含义，.(点)就是一个元字符
正因为这些元字符的存在，正则表达式才变得强大

\反斜杠：

正则转义
import re
my_str = ‘相亲竟2不可.接近3缘分.或我5应该相2信是缘分’
d = re.findall(r’.’,my_str)
print(a,b,c,d) #[’.’, ‘.’]

[ ]：

字符集合
my_str = ‘相亲竟2不可接近3缘分，或我5应该相2信是缘分’
a = re.findall(r’缘.’,my_str)
b = re.findall(r’[2]’,my_str)
c = re.findall(r’[0-9]’,my_str)
print(a,b,c) # [‘缘分’, ‘缘分’] [‘2’, ‘2’] [‘2’, ‘3’, ‘5’, ‘2’]

.点：

通配符，匹配任意的一个字符
a = re.findall(r".",“hc”)
b = re.findall(r".","\nhc")
c = re.findall(r".","^hc")
print(a,b) #[‘h’, ‘c’],[‘h’, ‘c’],[’^’, ‘h’, ‘c’]
匹配除换行符之外的所有字符
\d 匹配0~9的数字，re.search(r"\d",“ab12”)
\s 匹配任意的空白符，包括空格，制表符（Tab），换行符等re.search(r"\s",
“ab 12”)
\w 匹配字母或数字或下划线或汉字等 re.search(r"\w",“ab 12”)
. 表示匹配点号本身，re.search(r".",“adc.123”)
\D,\S,\W,\B是与小写的相反的作用，re.search(r’\D",“adc123”)
\D 除了数字以外的字符

锚点元字符

^脱字符：

只匹配字符串的开始的位置
如果在 [ ] 中表示取反，即‘非’
import re
my_str = ‘缘分相亲竟2不可.接近3缘分.或我5应该相2信是缘分’
a = re.findall(r’^缘分’,my_str)
print(a) #[‘缘分’]，如果字符串首位没有则返回[ ]

$美元符号：

只匹配字符串的结束位置
import re
my_str = ‘缘分相亲竟2不可.接近3缘分.或我5应该相2信是缘分’
a = re.findall(r’缘分 $KaTeX parse error: Expected 'EOF', got '#' at position 21: \dotsstr) print(a) #̲['缘分']，如果字符串尾部没\dots$ ’,my_str)
b = re.findall(r’^缘分$’,m_str)
print(a,b) #[ ][‘缘分’]

单词边界（不是元字符）

\b会默认的匹配英文单词，只有一边表示只以一面为边界

  import re
  m_str = 'hello world ahello dor'
  a = re.findall(r'hello',m_str)
  b = re.findall(r'\bhello\b',m_str)
  c = re.findall(r'hello\b',m_str)
  d = re.findall(r'\bhello',m_str)
  print(a,b,c,d)		#['hello', 'hello'] ['hello'] ['hello', 'hello'] ['hello']

{}花括号：

控制匹配次数

  import re
  m_str = 'abbbbbbc'
  a = re.findall(r'ab{6}c',m_str)	#{个数}，修饰前面的
  b = re.findall(r'ab{2,6}c',m_str)	#表示范围，2-6个能取到，两边都能取到
  d = re.findall(r'ab{2,}c',m_str)	#2-max
  print(a,b,d)		#['abbbbbc']	['abbbbbc'] ['abbbbbc']	
  -调皮的分隔符-------------
  m_str = 'ab53bb6bbb889c34678532'
  c = re.findall(r'[0-9]{2,6}',m_str)	#l0-9之间的数字连着2-6个能取到
  print(c)		#	['53', '889', '346785', '32']

控制次数，修饰的是前面的

*星号：

匹配0次或多次，匹配前面的,任意个

  my_str = 'ac'
  m_str = 'abbbbbbc'
  a = re.findall(r'ab*c',my_str)
  b = re.findall(r'ab*c',m_str)
  print(a,b)		# ['ac'] ['abbbbbbc']

+加号：

匹配1次或多次，不能取0，效果同* ，至少一个

  my_str = 'ac'
  m_str = 'abbbbbbc'
  a = re.findall(r'ab+c',my_str)
  b = re.findall(r'ab+c',m_str)
  print(a,b)		# [ ] ['abbbbbbc']

?问号：

匹配0次或1次，不能大于一个

  my_str = 'ac'
  m_str = 'abbbbbbc'
  a = re.findall(r'ab?c',my_str)
  b = re.findall(r'ab?c',m_str)
  print(a,b)		# ['ac'] [ ]

贪婪模式（默认的）

尽量多的提取/匹配出来，前面的是贪婪模式

非贪婪模式（加问号）

只要能满足，刚达到标准就可以

  import re
  my_str = 'ac'
  m_str = 'abbbbbbc'
  a = re.findall(r'ab*?',my_str)	#非贪婪
  b = re.findall(r'ab*',m_str)	#贪婪
  c = re.findall(r'ab+?',my_str)	#非贪婪
  print(a,b,c)		#['a'] ['abbbbbb'] ['ab']

在这里插入图片描述

| 选择元字符(连接)相当于or

管道符

  import re
  my_str = 'hello world python world hello python python world hello'
  a = re.findall(r'hello|python',my_str)
  b = re.findall(r'hello python',my_str)
  print(a,b)
  #['hello', 'python', 'hello', 'python', 'python', 'hello'] ['hello python']

[ ]字符组

整个中括号代表一个字符

  import re
  my_str = 'abbc abbd abbe'
  a = re.findall(r'[abc]',my_str)	#a或者b或者c
  b = re.findall(r'abb[ace]',my_str)
  print(a,b)		#['a', 'b', 'b', 'b', 'b', 'c'] ['abbc', 'abbe']

-调皮的分隔线-----------------

   	import re
  	my_str = 'ab2bc abb4d abbe'
  	c = re.findall(r'[0-9][a-z]',my_str)
  	d = re.findall(r'[0-9A-Za-z]',my_str)		#取任意的字母或数字
  	a = re.findall(r'[0-9A-Za-z]{4}',my_str)		#取四个任意的字母或数字
  	b = re.findall(r'[^0-9A-Za-z]',my_str)		#表示除此之外，不包括,反向字符类
  	print(c,d,a,b)		#['2b', '4d'] ['a', 'b', '2', 'b', 'c', 'a', 'b', 'b', '4', 'd', 'a', 'b', 'b', 'e'] ['ab2b', 'abb4', 'abbe']	[' ', ' ']

反向字符类[^ ]

^放到中括号里

  import re
  b = re.findall(r'[^0-9A-Za-z]',my_str)		#表示除此之外，不包括,反向字符类
  print(b)  #[' ', ' ']

预定义字符

在这里插入图片描述

()小括号：

分组匹配

如果正则表达式中使用了括号，那么findall函数匹配的结果只会是括号中的内容，而不是完整的匹配

  import re
  my_str = '<link rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/common/pkg/co_2e2be3c.css"/><link rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/searchresult/pkg/result_d23990a.css"/><link rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/common/widget/ui/slider/slider_ecce195.css"/><link rel="stylesheet" type="text/css" href="//img0.bdstatic.com/static/common/widget/ui/userInfo/userInfo_5bd6198.css"/><link rel="stylesheet" type="text/css" href="//img0.bdstatic.com/static/searchresult/widget/ui/base/view/AvFilterView/AvFilterView_5709328.css"/><link rel="stylesheet" type="text/css" '
  a = re.findall(r'href="(.*?)"',my_str)
  for i in a:
  	print("http"+ i)		
  #http//img1.bdstatic.com/static/common/pkg/co_2e2be3c.css
  http//img1.bdstatic.com/static/searchresult/pkg/result_d23990a.css
  http//img1.bdstatic.com/static/common/widget/ui/slider/slider_ecce195.css	http//img0.bdstatic.com/static/common/widget/ui/userInfo/userInfo_5bd6198.css		http//img0.bdstatic.com/static/searchresult/widget/ui/base/view/AvFilterView/AvFilterView_5709328.css

.*?意思是任意多个字符

match

import re

# 匹配某个字符串
text = 'hello'
ret = re.match('he',text)   #match 从字符串开始部分匹配
ret1 = re.match('e',text)   #match 从字符串开始部分匹配
# print(ret1)     # None   因为 e 不在开始部位
# print(ret)  #<_sre.SRE_Match object; span=(0, 2), match='he'>，  如果匹配不到返回 None
# print(ret.group())  # he  #group() 将匹配到的字符串打印出来，如果是 None 则无法用 group()

#.：匹配任意的一个字符，匹配 \ 要注意：1.有特殊意义的会匹配到其特殊意义，比如 \t 表示一个 tab，但匹配不到换行符 \n
#2.没有特殊意义的 比如 \m 就会匹配到 \
text = 'hello'
a = '+sdfsfdwfds'
b = '\m'
ret = re.match('.',text)   #match 从字符串开始部分匹配
ra = re.match('.',a)   #match 从字符串开始部分匹配
rb = re.match('.',b)
# print(ret.group(),ra.group(),rb.group())   #h + \

# \d：匹配任意数字
a = '42345324'  #注意这里是 str，不能使 int，正则只匹配字符串
a1 = 'saf42345324'  #字符串第一个不是数字，匹配不到
ra = re.match('\d',a)   #匹配字符串的第一个 数字
ra1 = re.match('\d',a1)
# print(ra.group(),ra1)   # 4 None

# \D：匹配任意非数字，和 \d 正好相反
a = '42345324'  #注字符串第一个是数字，匹配不到
a1 = 'saf42345324'  #字符串第一个不是数字，就能匹配不到
a2 = '\saf42345324'  #字符串第一个不是数字，就能匹配不到
ra = re.match('\D',a)
ra1 = re.match('\D',a1) #匹配字符串的第一个 非数字
ra2 = re.match('\D',a2) #匹配字符串的第一个 非数字
# print(ra,ra1.group(),ra2.group())   # None s \

#\s：匹配空白字符 （\n,\t,\r,空格）
a = ' 42345324'
a1 = '\nsaf42345324'
a2 = '\taf42345324'
a3 = '\raf42345324'
ra = re.match('\s',a)
ra1 = re.match('\s',a1)
ra2 = re.match('\s',a2)
ra3 = re.match('\s',a3)
# print(ra)   #<_sre.SRE_Match object; span=(0, 1), match=' '>
# print(ra1)  #<_sre.SRE_Match object; span=(0, 1), match='\n'>
# print(ra2)  #<_sre.SRE_Match object; span=(0, 1), match='\t'>
# print(ra3)  #<_sre.SRE_Match object; span=(0, 1), match='\r'>

#\w：匹配 a-z，A-Z 以及数字和下划线
a = 'v2345324'
a1 = 'Saf42345324'
a2 = '_af42345324'
a3 = '2af42345324'
a4 = '*2af42345324'
ra = re.match('\w',a)
ra1 = re.match('\w',a1)
ra2 = re.match('\w',a2)
ra3 = re.match('\w',a3)
ra4 = re.match('\w',a4)
# print(ra)   #<_sre.SRE_Match object; span=(0, 1), match='v'>
# print(ra1)  #<_sre.SRE_Match object; span=(0, 1), match='S'>
# print(ra2)  #<_sre.SRE_Match object; span=(0, 1), match='_'>
# print(ra3)  #<_sre.SRE_Match object; span=(0, 1), match='2'>
# print(ra4)  #None

#\W：和 \w 正好相反
a = 'v2345324'
a1 = 'Saf42345324'
a2 = '_af42345324'
a3 = '2af42345324'
a4 = '*2af42345324'
ra = re.match('\W',a)
ra1 = re.match('\W',a1)
ra2 = re.match('\W',a2)
ra3 = re.match('\W',a3)
ra4 = re.match('\W',a4)
# print(ra)   #None
# print(ra1)  #None
# print(ra2)  #None
# print(ra3)  #None
# print(ra4)  #<_sre.SRE_Match object; span=(0, 1), match='*'>

# []组合的方式，只要满足中括号中的字符，就可以匹配
a = '0416-2348975'  #注字符串第一个是数字，匹配不到
a1 = 's-42345324'
a2 = '041a6-2m348s975'
a3 = '242345s324'
a4 = '2af42345324'
a5 = '2af42345324'
a6 = 'af42345324'
a7 = '*(-\&%3s%$*'
#+：表示匹配多个，这里凡是连续的数字或者 - 都能匹配到，但仍然是两者之一开头，而且中间有其他形式的就停止匹配
ra = re.match('[\-\d]+',a)
ra1 = re.match('[\d\-]+',a1)
ra2 = re.match('[\d\-]+',a2)
ra3 = re.match('[0-9]+',a3)
ra4 = re.match('[0-9a-z]+',a4)
#^：在 [] 中表示‘非’，即取反
ra5 = re.match('[^0-9]+',a5)
ra6 = re.match('[^0-9]+',a6)
#匹配非数字，非字母，但是匹配不到‘^’
ra7 = re.match('[^0-9^a-z]+',a7)
# print(ra)   #<_sre.SRE_Match object; span=(0, 12), match='0416-2348975'>
# print(ra1)  #None   因为以字母开头，所以匹配不到
# print(ra2)  #<_sre.SRE_Match object; span=(0, 3), match='041'>  041 后面是字母a，匹配不到，匹配终止
# print(ra3)  #<_sre.SRE_Match object; span=(0, 6), match='242345'>  0-9的数字都能匹配到，中间有字母a，匹配不到，匹配终止
# print(ra4)  #<_sre.SRE_Match object; span=(0, 11), match='2af42345324'>
# print(ra5)  #None  非 0-9数字 都能匹配到
# print(ra6)  #<_sre.SRE_Match object; span=(0, 2), match='af'>  非 0-9数字 都能匹配到，碰到数字，匹配终止
# print(ra7)  #<_sre.SRE_Match object; span=(0, 6), match='*(-\\&%'>

#*：匹配任意个（包括 0 个），连续的
a = '0416-2348975'
a1 = re.match('\d*',a)
a2 = re.match('\D*',a)
# print(a1.group())   #0416
# print(a2)   #<_sre.SRE_Match object; span=(0, 0), match=''>

#+：匹配一个或多个，连续的
a = '0416-2348975'
a1 = re.match('\d+',a)
a2 = re.match('\D+',a)
# print(a1.group())   #0416
# print(a2)   #None

#?：匹配一个或者零个
a = '0416-2348975'
a1 = re.match('\D?',a)  #如果不加?，会报错
a2 = re.match('\d?',a)  
# print(a1.group()) #
# print(a2.group()) #0

#{m}：匹配 m 个字符
a1 = re.match('\d{3}',a)
# print(a1.group())   #041

#{m,n}：匹配 m-n 个字符，取最大可能的数，如果范围内没有会报错
a1 = re.match('\d{1,4}',a)  #此处最多能取 4 个，如果取的最小值大于 4 会报错，比如 {5，7}
# print(a1.group())   #0416

#$：以。。。结尾，^：以。。。开头
a = 'xiaogeldx@google.com'
#$ 要放在最后
# \:转义字符，在正则中 . 表示任意一个字符，用 \ 将其转义为 邮箱中的 .，虽然在这里不会有什么变化
rep1 = re.search('^xiaoge\w+@google\.com$',a)
#match方法就相当于加了 ^，因为 match 是从头开始的
rep2 = re.match('^xiaoge\w+@google\.com$',a)
# print(rep1.group())   #xiaogeldx@google.com
# print(rep2.group())   #xiaogeldx@google.com

# 贪婪模式，非贪婪模式
a = '15636196961'
# 贪婪模式：取尽可能多的满足条件的
rep1 = re.match('\d+',a)
# 非贪婪模式：在贪婪模式下取最少的满足条件的，+ 情况下最小的是 1，* 最小的情况下是 0
rep2 = re.match('\d+?',a)
rep3 = re.match('\d*?',a)
print(rep1.group()) #15636196961
print(rep2.group()) #1
print(rep3.group()) #

# 原生：r，在语言中，前面加 r 表示原生，不加工的
#正则和 python 中，\ 都作为转义
# 取消转义：\，取消加工
a = '\n'
b = r'\n'
c = '\\n'
# print(a,b,c)    #
#                 \n \n
a = '\c'
# 在 python 中，想要得到 \c，就需要 \\c 进行转义
# 在正则中，想要得到 python 中的 \\c，就需要 \\\\c 进行转义（每两个 \ 得到一个 \）
rep1 = re.match('\\\\c',a)
# 用 r 表示原生的，即取消了 python 的转义，所以用 \\c ，只转义 正则的
rep2 = re.match(r'\\c',a)
print(rep1.group()) #\c
print(rep2.group()) #\c

小案例

# 1. 验证手机号码
text = '15636196961'
#第一个数是 1，第二个数是 3-8 中的一个，后九位随意一个数
rep = re.match('1[3-8]\d{9}',text)
# print(rep.group())

# 2. 验证邮箱
text = 'xiaogeldx@163.com'
rep = re.match('\w+@[a-zA-Z0-9]+\.[a-z]+',text)
# print(rep.group())

# 3. 验证 URL
url = 'https://baike.baidu.com/item/%E5%91%A8%E6%98%9F%E9%A9%B0/169917?fr=aladdin'
# |：或，^：取反
rep = re.match('(http|https|ftp)://[^\s]+',url)
# print(rep.group())

#4. 验证身份证
id1 = '23220219090305581X'
id2 = '232202190903055818'
rep1 = re.match('\d{17}[\dxX]',id1)
rep2 = re.match('\d{17}[\dxX]',id2)
print(rep1.group())
print(rep2.group())

search，分组

任意位置匹配，只匹配一个，找不到返回none

和 match 唯一区别：match 是从头开始匹配，search 从任意位置开始匹配

  import re
  a = "012qwerAI"
  e = re.search("([0-9]*)([a-z]*)([A-Z]*)",a).group()
  b = re.search("([0-9]*)([a-z]*)([A-Z]*)",a).group(0)	#和 e 一样
  c = re.search("([0-9]*)([a-z]*)([A-Z]*)",a).group(1)
  d = re.search("([0-9]*)([a-z]*)([A-Z]*)",a).group(2)
  print(b)   #123abc456,返回整体
  print(c)  #123
  print(d)  #123
  print(e)  #123

a = 'apple is $20, orange is $10'
rep = re.search('.*(\$\d+).*(\$\d+)',a)
print(rep.group())  # apple is $20, orange is $10
print(rep.group(0)) # apple is $20, orange is $10
print(rep.group(1)) # $20
print(rep.group(2)) # $10
# 取第一个和第二个分组
print(rep.group(1,2))   # ('$20', '$10')
# 取所有的分组中的内容
print(rep.groups())   # ('$20', '$10')

#group（0）和group（）效果相同，均为获取取得的字符串整体。

findall

找出所有满足条件的，返回一个列表

# findall
a = 'apple is $20, orange is $10'
rep1 = re.findall('\$\d+',a)
print(rep1)	#['$20', '$10']

sub

替换，替换所有

  import re
  my_str = '<link rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/common/pkg/co_2e2be3c.css"/><link rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/searchresult/pkg/result_d23990a.css"/><link rel="stylesheet"'
  a = re.sub(r'link','a',my_str)
  print(a)		#<a rel="stylesheet" type="text/css"       href="//img1.bdstatic.com/static/common/pkg/co_2e2be3c.css"/><a rel="stylesheet" type="text/css" href="//img1.bdstatic.com/static/searchresult/pkg/result_d23990a.css"/><a   rel="stylesheet"

html = '''
<span><a href="//www.51job.com/changsha/">长沙</a></span>
<span><a href="//www.51job.com/chengdu/">成都</a></span>
<span><a href="//www.51job.com/chongqing/">重庆</a></span>
<span><a href="//www.51job.com/changzhou/">常州</a></span>
<span><a href="//www.51job.com/changde/">常德</a></span>
<span><a href="//www.51job.com/changshu/">常熟</a></span>
<span><a href="//www.51job.com/cangzhou/">沧州</a></span>
<span><a href="//www.51job.com/chaozhou/">潮州</a></span>
<span><a href="//www.51job.com/chenzhou/">郴州</a></span>
<span><a href="//www.51job.com/chifeng/">赤峰</a></span>
<span><a href="//www.51job.com/chuzhou/">滁州</a></span>
<span><a href="//www.51job.com/changzhi/">长治</a></span>
'''
# 将以 < 为首，以 > 为尾的以及中间的内容都用空字符串代替
rep = re.sub('<.+?>','',html)
print(rep)  #长沙
            # 成都
            # 重庆
            # 常州
            # 常德
            # 常熟
            # 沧州
            # 潮州
            # 郴州
            # 赤峰
            # 滁州
            # 长治

split

和 python 字符串的 split 意思一样，只不过得到的是列表

a = 'apple is $10, orange is $5, banana is $11'
ret = re.split(',',a)
print(ret)  #['apple is $10', ' orange is $5', ' banana is $11']

compile

对于一些常用的正则表达式，可以用 compile 进行编译，后期再使用的时候可以直接拿来用，执行效率更快，而且 compile 还可以指定 flag = re.VERBOSE，在写正则表达式的时候可以做好注释

a = 'apple is $10, orange is $5, banana is $11'
# 将正则表达式进行编译，也可以用下面增加注释的方式，更容易理解
#ret = re.compile('\$\d+')
ret = re.compile('''
    \$  #获取美元符号 $，经过转义的
    \d+ #获取钱数
''',re.VERBOSE)
# 将编译好的拿来用，第一个参数是编译好的正则表达式，第二个参数是目标字符串
ret2 = re.findall(ret,a)
print(ret2)	#['$10', '$5', '$11']

案例：

取一个字符串的第一个单词，如：“Hello world”，" a word “，”… and so on …"， “don’t touch it”
```
  import re
  	def first_word(text: str) -> str:
  	    return re.search("([\w']+)", text).group(1)
```

#()- 表示组，[ ]- 符号集，\w- 任何字母，[\w’]- 任何字母或’，[\w’]± 任何字母或’一次或多次，如果[\w’]+改为[a-zA-Z’]+，以数字开头依然没问题

古诗文网

这里只获取第一页的，其他页的可以将下面代码封装成一个函数，再写个函数用于获取每一页的 url，再传递给下面这个函数

import requests
import re

url = 'https://www.gushiwen.org/default_1.aspx' #第一页的 url
headers = {
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'https://www.gushiwen.org/default_2.aspx',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
}
session = requests.Session()
response = session.get(url,headers=headers)
html_str = response.content.decode('utf8')
# print(html_str)
# re.DOTALL：表示 . 可以代替所有的字符，包括 \n
title_list = re.findall('<div class="cont">.*?<b>(.*?)</b>',html_str,re.DOTALL)
year_list = re.findall('<div class="cont">.*?<p class="source"><a href=".*?>(.*?)</a>',html_str,re.DOTALL)
author_list = re.findall('<div class="cont">.*?<p class="source">.*?<span>：</span><a href=".*?>(.*?)</a>',html_str,re.DOTALL)
content_list = re.findall('<div class="cont">.*?<div class="contson".*?>(.*?)</div>',html_str,re.DOTALL)
contents = []
poem_list = []
#处理 1，效果等同于处理 2 中的两个 for 循环
# for index, content in enumerate(content_list):
#     content = re.sub(r'<.*?>','',content).strip()
#     contents.append(content)
#     poems = {
#         'title': title_list[index],
#         'year': year_list[index],
#         'author': author_list[index],
#         'content': contents[index]
#         }
#     poem_list.append(poems)

#处理2
for content in content_list:
    content = re.sub(r'<.*?>','',content).strip()
    contents.append(content)
for value in zip(title_list,year_list,author_list,contents):
'''
zip()：参数是多个列表，返回的是  多个列表的对应索引的元素一一对应，形成的元组
'''
	#将元组中的各个元素赋值给这四个属性
    title,year,author,content = value
    poem = {
        'title': title,
        'year': year,
        'author': author,
        'content': content
    }
    poem_list.append(poem)
print(poem_list)

xiaogeldx

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
正则

正则nu = input(“请输入电话号码:”)def phone_number(st):st = str(st)if len(st) == 11 and st.isdigit() and st.startswith(“1”):return Trueelse:print(“Phone Number Error”)return False通过对比，可以很明显的发现，下面这种方式能够...
复制链接

扫一扫