beautifulsoup爬取网页中的表格_python爬取网站数据四种姿势，你值得拥有~

最新推荐文章于 2024-07-07 06:42:20 发布

weixin_39725154

最新推荐文章于 2024-07-07 06:42:20 发布

阅读量778

点赞数

文章标签： beautifulsoup爬取网页中的表格 python threadpoolexecutor Python爬取网站用户手机号

本文通过爬取wikidata.org上的名人信息，对比了4种爬虫方法：同步（requests+BeautifulSoup）、并发（concurrent.futures）、异步（aiohttp+asyncio）和Scrapy框架。同步方法简单易懂但效率低，并发方法速度提升但线程切换有开销，异步方法高效但需掌握异步编程，Scrapy框架提供成熟解决方案，速度较快且支持自动导出CSV。

摘要由CSDN通过智能技术生成

前言

首先，分析来爬虫的思路：先在第一个网页（https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）中得到500个名人所在的网址，接下来就爬取这500个网页中的名人的名字及描述，如无描述，则跳过。接下来，我们将介绍实现这个爬虫的4种方法，并分析它们各自的优缺点，希望能让读者对爬虫有更多的体会。实现爬虫的方法为：

一般方法（同步，requests+BeautifulSoup）
并发（使用concurrent.futures模块以及requests+BeautifulSoup）
异步（使用aiohttp+asyncio+requests+BeautifulSoup）
使用框架Scrapy

一般方法

一般方法即为同步方法，主要使用requests+BeautifulSoup，按顺序执行。完整的Python代码如下：

import

输出的结果如下(省略中间的输出，以……代替)：

##################################################
George Washington                       ,    first President of the United States
Douglas Adams                           ,    British author and humorist (1952–2001)
......
Willoughby Newton                       ,    Politician from Virginia, USA
Mack Wilberg                            ,    American conductor
一般方法，总共耗时：724.9654655456543
##################################################

使用同步方法，总耗时约725秒，即12分钟多。一般方法虽然思路简单，容易实现，但效率不高，耗时长。那么，使用并发试试看。

并发方法

并发方法使用多线程来加速一般方法，我们使用的并发模块为concurrent.futures模块，设置多线程的个数为20个（实际不一定能达到，视计算机而定）。完整的Python代码如下：

import requests
from bs4 import BeautifulSoup
import time
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

# 开始时间
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 请求头部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 发送HTTP请求
req = requests.get(url, headers=headers)
# 解析网页
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的记录
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 获取网址
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org'+url)

# 获取每个网页的name和description
def parser(url):
    req = requests.get(url)
    # 利用BeautifulSoup将获取到的文本解析成HTML
    soup = BeautifulSoup(req.text, "lxml")
    # 获取name和description
    name = soup.find('span', class_="wikibase-title-label")
    desc = soup.find('span', class_="wikibase-descri