python爬虫库的对比

阿狸轰

于 2023-11-19 18:01:24 发布

阅读量733

点赞数 1

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Alaskan_Husky/article/details/134492809

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

BeautifulSoup、Scrapy 和 Requests 是 Python 中常用于爬虫的三个库，它们各自有不同的功能和用途。以下是它们的主要特点和用法对比：

Requests：

特点：

简单易用： Requests 是一个简单而直观的库，使得发送 HTTP 请求变得非常容易。
广泛用于网页请求： 主要用于向网站发送 HTTP 请求并获取响应，方便处理返回的 HTML 或其他数据。
不支持解析 HTML 结构： Requests 本身不提供解析 HTML 结构的功能，因此通常需要结合其他库（如 BeautifulSoup）来处理网页内容。

用法示例

pip install requests

发送 GET 请求：

import requests

url = 'https://www.example.com'
response = requests.get(url)

# 获取响应内容
content = response.text
print(content)

# 获取状态码
status_code = response.status_code
print(f'Status Code: {status_code}')

发送 POST 请求：

import requests

url = 'https://www.example.com'
data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post(url, data=data)

# 获取响应内容
content = response.text
print(content)

# 获取状态码
status_code = response.status_code
print(f'Status Code: {status_code}')

设置请求头：

import requests

url = 'https://www.example.com'
headers = {'User-Agent': 'MyApp/1.0'}
response = requests.get(url, headers=headers)

# 获取响应内容
content = response.text
print(content)

处理 JSON 数据：

import requests

url = 'https://api.example.com/data'
response = requests.get(url)

# 获取 JSON 响应内容
json_data = response.json()
print(json_data)

BeautifulSoup：

特点：

HTML 解析： BeautifulSoup 专注于解析 HTML 结构，提供了方便的方法来提取所需的数据。
容错性强： 对于不规范的 HTML，BeautifulSoup 具有较强的容错性，能够解析出有用的信息。
结合 Requests 使用： 通常与 Requests 结合使用，用 Requests 获取网页内容，再使用 BeautifulSoup 解析。

用法示例：

pip install beautifulsoup4

from bs4 import BeautifulSoup

# HTML 示例
html_doc = """
<html>
  <head>
    <title>My Web Page</title>
  </head>
  <body>
    <p class="first-paragraph">This is the first paragraph.</p>
    <p class="second-paragraph">This is the second paragraph.</p>
    <p class="second-paragraph">Another second paragraph.</p>
  </body>
</html>
"""

# 创建 BeautifulSoup 对象
soup = BeautifulSoup(html_doc, 'html.parser')

# 输出整个 HTML 结构
print(soup.prettify())

根据标签名查找元素：

# 查找第一个 p 标签
first_p = soup.find('p')
print(first_p.text)

# 查找所有 p 标签
all_p = soup.find_all('p')
for p in all_p:
    print(p.text)

根据类名查找元素：

# 查找 class 为 "second-paragraph" 的所有 p 标签
second_paragraphs = soup.find_all('p', class_='second-paragraph')
for p in second_paragraphs:
    print(p.text)

获取属性值：

# 获取第一个 p 标签的 class 属性值
class_value = first_p['class']
print(class_value)

获取父节点和子节点：

# 获取第一个 p 标签的父节点
parent = first_p.parent
print(parent.name)

# 获取第一个 p 标签的所有子节点
children = first_p.find_all(recursive=False)
for child in children:
    print(child.name)

使用 CSS 选择器：

# 使用 CSS 选择器查找所有 p 标签
p_elements = soup.select('p')
for p in p_elements:
    print(p.text)

基本用法：

Scrapy：

特点：

全功能的爬虫框架： Scrapy 是一个全功能的爬虫框架，包含了爬虫的整个生命周期，包括请求、处理和存储数据。
异步处理： 支持异步处理，能够高效地处理大量请求。
规范结构： Scrapy 有明确的项目结构，易于组织和管理。

用法示例：

安装 Scrapy：

pip install scrapy

创建 Scrapy 项目：

scrapy startproject myproject

定义爬虫：

# myproject/spiders/example_spider.py
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        # 在这里定义解析逻辑
        pass

编写解析逻辑：

# myproject/spiders/example_spider.py
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        # 提取标题
        title = response.css('title::text').get()
        print(title)

运行爬虫：

scrapy crawl example

存储数据：

# myproject/spiders/example_spider.py
import scrapy
import json

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        # 提取标题
        title = response.css('title::text').get()

        # 存储数据到 JSON 文件
        with open('output.json', 'w') as f:
            json.dump({'title': title}, f)

更多功能：

Scrapy 提供了许多其他功能，如设置请求头、处理 Cookies、处理异常、使用中间件等。你可以在 Scrapy 文档中找到更多详细信息：Scrapy 官方文档

对比总结：

Requests： 适用于简单的网页请求和数据获取，不具备 HTML 解析的功能。
BeautifulSoup： 专注于 HTML 解析，适合从网页中提取结构化数据。
Scrapy： 是一个全功能的爬虫框架，适合构建大规模、高效的爬虫，具有规范的项目结构和生命周期。

其他：

Selenium + WebDriver：

特点： 使用 Selenium 驱动浏览器，可以模拟用户行为，解决 JavaScript 渲染问题。
适用场景： 针对动态页面、需要执行 JavaScript 的情况，例如使用 AJAX 加载数据的页面。

pip install selenium

下载 WebDriver：

Selenium 需要与特定浏览器版本匹配的 WebDriver。你需要下载对应浏览器版本的 WebDriver，并将其路径配置到 Selenium 中。

Chrome WebDriver 下载： ChromeDriver

Firefox WebDriver 下载： GeckoDriver

编写爬虫代码：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# 使用 Chrome WebDriver
driver = webdriver.Chrome(executable_path='path/to/chromedriver')

# 或者使用 Firefox WebDriver
# driver = webdriver.Firefox(executable_path='path/to/geckodriver')

# 打开网页
driver.get("http://www.example.com")

# 等待页面加载（可选）
time.sleep(2)

# 进行操作，例如点击按钮或输入文本
element = driver.find_element_by_name("search")
element.send_keys("Python")
element.send_keys(Keys.RETURN)

# 等待加载结果
time.sleep(2)

# 提取页面数据
results = driver.find_elements_by_css_selector(".search-results-item")
for result in results:
    print(result.text)

# 关闭浏览器
driver.quit()

注意事项：请确保 WebDriver 的版本与你安装的浏览器版本匹配。在进行网页操作时，建议使用 time.sleep 或等待元素加载的方法，以确保页面加载完成。

Splash：

特点： Splash 是一个基于浏览器引擎 WebKit 的轻量级浏览器，通过 HTTP API 进行控制。可以用于解决 JavaScript 渲染问题。

适用场景： 适用于需要渲染 JavaScript 的页面，但相比完整的浏览器引擎更轻量。

安装 Splash：

首先，你需要安装 Docker，然后通过 Docker 安装 Splash。可以使用以下命令：

docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash

安装 Splash Python 客户端：

你需要安装 Splash 的 Python 客户端，可以使用以下命令：

pip install requests
pip install splashapi

编写爬虫代码：

import requests

# Splash 服务地址
splash_url = 'http://localhost:8050'

# 目标网址
url = 'http://www.example.com'

# JavaScript 脚本，例如模拟滚动加载更多内容
script = """
function main(splash)
    splash:go(splash.args.url)
    splash:wait(1)  -- 等待1秒
    splash:runjs("window.scrollTo(0, document.body.scrollHeight);")  -- 模拟滚动到底部
    splash:wait(1)  -- 等待1秒
    return splash:html()
end
"""

# 发送请求到 Splash
response = requests.post(
    f'{splash_url}/execute',
    json={
        'lua_source': script,
        'url': url,
    }
)

# 获取渲染后的页面内容
html_content = response.text
print(html_content)

注意事项：

Splash 默认监听在 8050 端口，如果你修改了端口，请相应修改代码中的 splash_url。
使用 Splash 时，建议加入适当的等待，确保 JavaScript 执行完成。
Splash 还支持更多高级特性，如设置视口大小、截图、执行特定的JavaScript代码等。你可以参考 Splash 的文档了解更多信息：Splash Documentation

Pyppeteer 或 Puppeteer：

特点： Pyppeteer 是 Puppeteer 的 Python 版本，可以使用这两个工具来模拟浏览器行为。

适用场景： 适用于处理 JavaScript 渲染的页面，支持更复杂的操作和交互。

安装 Pyppeteer：

首先，你需要安装 Pyppeteer。你可以使用以下命令通过 pip 安装：

pip install pyppeteer

编写爬虫代码：

import asyncio
from pyppeteer import launch

async def main():
    # 启动浏览器
    browser = await launch()

    # 打开新页面
    page = await browser.newPage()

    # 访问网址
    await page.goto('http://www.example.com')

    # 等待页面加载完成
    await page.waitForSelector('title')

    # 获取页面标题
    title = await page.title()
    print(title)

    # 执行 JavaScript 代码
    dimensions = await page.evaluate('''() => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    }''')
    print(dimensions)

    # 截图
    await page.screenshot({'path': 'screenshot.png'})

    # 关闭浏览器
    await browser.close()

# 运行异步事件循环
asyncio.get_event_loop().run_until_complete(main())

注意事项：

Pyppeteer 使用了异步编程，需要在异步事件循环中运行。在这个例子中，我们使用了 asyncio 来运行异步事件循环。
确保你的环境中已经安装了 Chrome 浏览器，Pyppeteer 默认通过 chrome 命令来启动浏览器。你也可以通过 executablePath 参数指定 Chrome 的可执行文件路径。
Pyppeteer 提供了更多的功能，例如设置请求头、模拟用户输入、处理 Cookie 等。你可以参考 Pyppeteer 的文档了解更多信息：Pyppeteer Documentation

Pyppeteer 是一种适用于处理 JavaScript 渲染的页面的爬虫实现方式，相较于使用完整的浏览器引擎，它更轻量且占用资源更少

使用 API：

特点： 有些网站提供 API 接口，可以直接通过请求 API 获取数据，避免解析 HTML。
适用场景： 当目标网站提供合适的 API 接口时，是一种简便且规范的数据获取方式。。

一般来说，如果只需要获取页面内容，可以使用 Requests 和 BeautifulSoup 结合。如果需要构建复杂的爬虫系统，包括大规模爬取和数据处理，可以选择使用 Scrapy。而对于需要模拟用户行为和处理 JavaScript 渲染的情况，可以选择 Selenium、Splash 或 Pyppeteer。