用正则表达式爬取数据,网络爬虫正则表达式

小发猫

于 2024-03-01 13:55:53 发布

阅读量483

点赞数 3

文章标签：前端 javascript 开发语言 pygame

本文链接：https://blog.csdn.net/i_like_cpp/article/details/136394341

版权

大家好，给大家分享一下用正则表达式爬取数据，很多人还不知道这一点。下面详细解释一下。现在让我们来看看！

来源于此为了方便自己查找，进行了简化与整理。

本文涉及内容如下：

获取< tr>< /tr>标签之间内容
获取< a href…>< /a>超链接之间内容
获取URL最后一个参数命名图片或传递参数
爬取网页中所有URL链接
爬取网页标题title两种方法
定位table位置并爬取属性-属性值
过滤等标签
获取< >< />等标签内容
通过replace函数过滤 标签
过滤html标签

1.获取 < tr> 标签之间内容

核心代码：

   s =  re.findall( r'<tr>(.*?)</tr>',language,re.S|re.M)

举个例子：

import re

doc = '''<tr><th>性別：</th><td>男</td></tr><tr>'''

string1 = re.findall(r'<tr>(.*?)</tr>', doc, re.S | re.M)
print(string1[0])

string2= re.findall(r'<td>(.*?)</td>', doc, re.S | re.M)
print(string2)#从这里可以看出返回的是一个列表

输出结果：
>>>
<th>性別：</th><td>男</td>
['男']
>>>
#  re.I: 忽略大小写
#  re.M: 多行模式，改变'^'和'$'的行为
#  re.S: 点任意匹配模式，改变'.'的行为

2.获取超链接< a href=..>< /a >之间内容

核心代码：

    url =  re.findall(r'<a .*?>(.*?)</a>', content, re.S|re.M)

举个例子：

import re

doc = '''
<td>
<a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a>
<a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a>
</td>
'''

# 获取<a href></a>之间的内容
print('获取链接文本内容:')
s1 = re.findall(r'<a .*?>(.*?)</a>', doc, re.I|re.S|re.M)
for i in s1:
    print(i)

# 获取所有<a href></a>链接所有内容
print('\n获取完整链接内容:')
s2 = re.findall(r"<a href=.*?</a>", doc, re.I|re.S|re.M)
for i in s2:
    print(i)

# 获取<a href></a>中的URL
print('\n获取链接中URL:')
s = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
s3 = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", doc, re.I|re.S|re.M)
for i in s3:
    print(i)

输出结果：
>>>
获取链接文本内容:
浙江省主题介绍
贵州省主题介绍

获取完整链接内容:
<a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a>
<a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a>

获取链接中URL:
https://www.baidu.com/articles/zj.html
https://www.baidu.com//articles/gz.html
>>>

3.获取URL最后一个参数命名图片或传递参数

举个例子：

url = "http://baidu.com.cn/file/2021/Img1415_00.jpg"
value = url.split('/')[-1]
print(value)

输出结果：
>>>
Img1415_00.jpg
>>>

4.爬取网页中所有URL链接

举个例子：

import re
import urllib.request

url = "http://www.csdn.net/"
doc = urllib.request.urlopen(url).read().decode('utf-8')
urls = re.findall(r"<a.*?href=.*?</a>", doc, re.I)
for url in urls:
    print(url)

link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", doc)
for url in link_list:
    print(url)

输出结果：
>>>
<a href="https://live.csdn.net" target="_blank" data-report-click="{&quot;spm&quot;:&quot;1000.2115.3001.4124&quot;,&quot;dest&quot;:&quot;https://live.csdn.net&quot;,&quot;extra&quot;:&quot;{\&quot;fId\&quot;:558,\&quot;fName\&quot;:\&quot;floor-www-index\&quot;,\&quot;compName\&quot;:\&quot;www-interaction\&quot;,\&quot;compDataId\&quot;:\&quot;index-nav-www\&quot;,\&quot;fTitle\&quot;:\&quot;\&quot;,\&quot;pageId\&quot;:141}&quot;}" data-report-query="spm=1000.2115.3001.4124"><img src="https://img-home.csdnimg.cn/images/20210708022656.png" alt="直播"> <span>直播</span></a>
<a href="https://blink.csdn.net/" target="_blank" data-report-click="{&quot;spm&quot;:&quot;1000.2115.3001.4124&quot;,&quot;dest&quot;:&quot;https://blink.csdn.net/&quot;,&quot;extra&quot;:&quot;{\&quot;fId\&quot;:558,\&quot;fName\&quot;:\&quot;floor-www-index\&quot;,\&quot;compName\&quot;:\&quot;www-interaction\&quot;,\&quot;compDataId\&quot;:\&quot;index-nav-www\&quot;,\&quot;fTitle\&quot;:\&quot;\&quot;,\&quot;pageId\&quot;:141}&quot;}" data-report-query="spm=1000.2115.3001.4124"><img src="https://img-home.csdnimg.cn/images/20200629060547.png" alt="动态"> <span>动态</span></a>
…………
https://g.csdnimg.cn/static/logo/favicon32.ico
https://www.csdn.net
…………
>>>

5.爬取网页标题title两种方法

举个例子：

import re
import urllib.request

url = "http://www.csdn.net/"
doc = urllib.request.urlopen(url).read().decode('utf-8')

print('方法一:')
ti = re.compile(r'(?<=<title>).*?(?=</title>)', re.M|re.S)
tit = re.search(ti, doc)
print(tit.group())
#group() :匹配正则表达式整体结果,同group(0)Python中Turtle画蝴蝶。
#group(1) 列出第一个括号匹配部分，group(2) 列出第二个括号匹配部分。
 
print('方法二:')
title = re.findall(r'<title>(.*?)</title>', doc)
print(title[0])

输出结果：
>>>
方法一:
CSDN - 专业开发者社区
方法二:
CSDN - 专业开发者社区
>>>

6.定位table位置并爬取属性-属性值

正则表达式可以通过find函数寻找指定table方法进行定位。

举个例子：

start = content.find(r'<table class="infobox vevent"') #起点记录查询位置  
end = content.find(r'</table>')#结束位置
print(doc[start:end])

标签内容中的属性值 td 可能存在其他属性，同时< td>< /td>之间的内容也需要处理。下面先讲解获取td值的例子：

import re

doc = '''<table>  
<tr>  
<td>序列号</td><td>DEIN3-39CD3-2093J3</td>  
<td>日期</td><td>2013年1月22日</td>  
<td>售价</td><td>392.70 元</td>  
<td>说明</td><td>仅限5用户使用</td>  
</tr>  
</table>
'''

s = r'<td>(.*?)</td><td>(.*?)</td>'
m = re.findall(s, doc, re.S | re.M)
for line in m:
    print(line[0],line[1])

输出结果：
>>>
序列号 DEIN3-39CD3-2093J3
日期 2013年1月22日
售价 392.70 元
说明 仅限5用户使用
>>>

如果为< td id=“”>，则正则表达式为r’< td id=.?>(.?)< /td>’
如果不是id属性开头，则正则表达式为r’<td .?>(.?)< /td>’

7.过滤等标签

举个例子：

import re

doc = '''
<table class="infobox bordered vcard" style="width: 21em; font-size: 89%; text-align: left;" cellpadding="3">
<tr>
<th>異名：</th>
<td><span class="nickname">(字) 翔宇</span></td>
</tr>
<tr>
<th>籍貫：</th>
<td><a href="../articles/%E81.html" title="浙江省">浙江省</a><a href="../articles/%E7%BB%8D82.html" title="绍兴市">绍兴市</a></td>
</tr>
</table>
'''

# 获取table中tr值
str = re.findall(r'<tr>(.*?)</tr>', doc, re.S | re.M)
for line in str:
    # 获取表格第二列td 属性
    td = re.findall(r'<td>(.*?)</td>', line, re.S|re.M)
    for n in td:
        if "span" in n:  # 处理标签<span>
            value = re.findall(r'<span .*?>(.*?)</span>', n, re.S|re.M)
            for i in value:
                print(i)

输出结果：
>>>
(字) 翔宇
>>>

8.获取< > < />等标签内容

在获取图集对应的原图它是存储在中，其中获取原图-original即可，缩略图-thumb，大图-big，通过正则表达式下载URL：

import re
import urllib.request
import os

content = '''
<>var images = [  
{ "big":"http://i-2.yxdown.com/2015/3/18/KDkwMHgp/6381cce.jpg",  
  "thumb":"http://i-2.yxdown.com/2015/3/18/KHgxMjAp/6381ccce.jpg",  
  "original":"http://i-2.yxdown.com/2015/3/18/6381ccc03e.jpg", }  
</>  
'''##我进行了简化，这里只是展示爬取方法，所以这个网址不存在404

str = re.findall(r'<>(.*?)</>', content, re.S | re.M)
for  in str:
    m = re.findall(r'"original":"(.*?)"', )
    for i in m:
        print(i)
        filename = os.path.basename(i)  # 去掉目录路径,返回文件名
        urllib.request.urlretrieve(i,'E:\\'+filename) #下载图片

输出结果：会输出网址，同时下载图片至E盘。

9.通过replace过滤 标签

核心代码：

    if '<br />' in value:
        value = value.replace('<br />','')   #过滤该标签
        value = value.replace('\n',' ')      #换行空格替代 否则总换行

10.过滤html标签

核心代码：

    value = re.sub('<[^>]+>','', doc)

举个例子：

import re

doc = '''
<table class="infobox" style="width: 21em; text-align: left;" cellpadding="3">
<tr bgcolor="#CDDBE8">
<th colspan="2">
<center class="role"><b>中華民國政治人士</b><br /></center>
</th>
</tr>
<tr>
<th>政黨：</th>
<td><span class="org">
<img alt="中國國民黨" src="../../../../images/Kuomintang.svg.png" width="19" height="19" border="0" />
<a href="../../../../articles/%8B%E6%B0%91%E9%BB%A8.html" title="中國國民黨">中國國民黨</a></span></td>
</tr>
</table>
'''

value = re.sub('<[^>]+>', '', doc)  # 过滤HTML标签
print(value)

输出结果：
>>>



中華民國政治人士



政黨：


中國國民黨
>>>

推荐文章：Python正则表达式指南

用正则表达式爬取数据,网络爬虫正则表达式

1.获取 < tr> 标签之间内容

2.获取超链接< a href=..>< /a >之间内容

3.获取URL最后一个参数命名图片或传递参数

4.爬取网页中所有URL链接

5.爬取网页标题title两种方法

6.定位table位置并爬取属性-属性值

7.过滤< span>< /span>等标签

8.获取< > < />等标签内容

9.通过replace过滤< br />标签

10.过滤html标签