beautifulsoup爬取网页中的表格_python爬取网站数据四种姿势，你值得拥有

最新推荐文章于 2024-06-20 10:22:07 发布

weixin_39834984

最新推荐文章于 2024-06-20 10:22:07 发布

阅读量453

点赞数

文章标签： beautifulsoup爬取网页中的表格 python list find Python爬取网站用户手机号

本文介绍了使用Python进行网页数据爬取的四种方法：同步（requests+BeautifulSoup）、并发（concurrent.futures）、异步（aiohttp+asyncio）以及Scrapy框架。通过对比分析，展示了它们的执行时间和效率，强调在实际应用中需根据需求选择合适的方法。

摘要由CSDN通过智能技术生成

前言

首先，分析来爬虫的思路：先在第一个网页(https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0)中得到500个名人所在的网址，接下来就爬取这500个网页中的名人的名字及描述，如无描述，则跳过。接下来，我们将介绍实现这个爬虫的4种方法，并分析它们各自的优缺点，希望能让读者对爬虫有更多的体会。实现爬虫的方法为：

一般方法(同步，requests+BeautifulSoup)
并发(使用concurrent.futures模块以及requests+BeautifulSoup)
异步(使用aiohttp+asyncio+requests+BeautifulSoup)
使用框架Scrapy

一般方法

一般方法即为同步方法，主要使用requests+BeautifulSoup，按顺序执行。完整的Python代码如下：

import requestsfrom bs4 import BeautifulSoupimport time#python学习群：695185429# 开始时间t1 = time.time()print('#' * 50)url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"# 请求头部headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}# 发送HTTP请求req = requests.get(url, headers=headers)# 解析网页soup = BeautifulSoup(req.text, "lxml")# 找到name和Description所在的记录human_list = soup.find(id='mw-whatlinkshere-list')('li')urls = []# 获取网址for human in human_list:    url = human.find('a')['href']    urls.append('https://www.wikidata.org'+url)# 获取每个网页的name和descriptiondef parser(url):    req = requests.get(url)    # 利用BeautifulSoup将获取到的文本解析成HTML    soup = BeautifulSoup(req.text, "lxml")    # 获取name和description    name = soup.find('span', class_="wikibase-title-label")    desc = soup.find('span', class_="wikibase-descriptionview-text")    if name is not None and desc is not None:        print('%-40s,%s'%(name.text, desc.text))for url in urls:    parser(url)t2 = time.time() # 结束时间print('一般方法，总共耗时：%s' % (t2 - t1))print('#' * 50)

输出的结果如下(省略中间的输出，以……代替)：

##################################################George Washington                       ,    first President of the United StatesDouglas Adams                           ,    British author and humorist (1952–2001)......Willoughby Newton                       ,    Politician from Virginia, USAMack Wilberg                            ,    American conductor一般方法，总共耗时：724.9654655456543##################################################

使用同步方法，总耗时约725秒，即12分钟多。一般方法虽然思路简单，容易实现，但效率不高，耗时长。那么，使用并发试试看。

最低0.47元/天解锁文章

weixin_39834984

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
beautifulsoup爬取网页中的表格_python爬取网站数据四种姿势，你值得拥有

前言首先，分析来爬虫的思路：先在第一个网页(https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0)中得到500个名人所在的网址，接下来就爬取这500个网页中的名人的名字及描述，如无描述，则跳过。接下来，我们将介绍实现这个爬虫的4种方法，并分析它们各自的优缺点，希望能让...
复制链接

扫一扫