爬取网页后的抓取数据_3种抓取网页数据方法

最新推荐文章于 2024-07-16 15:49:34 发布

Pop_Rain

最新推荐文章于 2024-07-16 15:49:34 发布

阅读量1.6w

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/pop_rain/article/details/72550663

版权

本文介绍了爬取网页数据的两种常见方法：正则表达式和Lxml库。详细讲解了正则表达式的使用，并对比了两者在处理效率和效果上的差异。

摘要由CSDN通过智能技术生成

1. 正则表达式

(1)

re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)

(2)

import re
pattern = re.compile("hello")
#match_list = re.findall(pattern, "hello world! hello") 这个是找全部匹配的，返回列表
match = pattern.match("hello world! hello") #这个是找匹配的，有就返回一个，没有返回None
print(match)

2. BeautifulSoup(bs4)

转Python中使用Beautiful Soup库的超详细教程：http://www.jb51.net/article/65287.htm

from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, "html.parser")    #用html解释器对得到的html文本进行解析
>>> tr = soup.find(attrs={"id":"places_area__row"})
>>> tr
<tr id="places_area__row"