GuzzleHttp和DomCrawler的具体用途？

爬虫程序猿

于 2025-05-22 15:57:18 发布

阅读量479

点赞数 3

文章标签：数据库爬虫 java python

本文链接：https://blog.csdn.net/wanbangAPI01/article/details/148143395

版权

GuzzleHttp 和 DomCrawler 是两个功能不同的库，它们在 PHP 开发中被广泛用于不同的场景。以下是对这两个库的具体用途和功能的详细解释：

1. GuzzleHttp

用途： GuzzleHttp 是一个功能强大的 HTTP 客户端库，用于发送 HTTP 请求。它提供了简单而强大的 API，可以方便地发送 GET、POST、PUT、DELETE 等 HTTP 请求，并处理响应。

主要功能：

发送 HTTP 请求：支持同步和异步请求。
处理响应：可以轻松处理响应内容，提取响应头和响应体。
配置灵活：支持多种配置选项，如超时时间、请求头、代理等。
支持多种请求方法：支持 GET、POST、PUT、DELETE 等 HTTP 方法。
错误处理：提供详细的错误信息，方便调试和处理异常。

示例代码：

php

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

function get_html($url) {
    $client = new Client();
    $response = $client->request('GET', $url, [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
        ]
    ]);
    return $response->getBody()->getContents();
}

$url = "https://example.com";
$html = get_html($url);
echo $html;
?>

2. DomCrawler

用途： DomCrawler 是一个用于解析 HTML 文档的库，属于 Symfony 组件之一。它提供了强大的功能来解析 HTML 内容，提取和操作数据，如获取文本、属性、子节点等。

主要功能：

解析 HTML：可以解析 HTML 文档，提取所需的数据。
选择器：支持 CSS 选择器，可以方便地选择特定的 HTML 元素。
操作 DOM：可以修改 HTML 文档的内容，如添加、删除或修改标签和属性。
提取数据：可以提取文本、属性等数据，支持链式调用。

示例代码：

php

<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;

function parse_html($html) {
    $crawler = new Crawler($html);
    $products = [];
    $crawler->filter('div.product-item')->each(function (Crawler $node) use (&$products) {
        $title = $node->filter('h3.product-title')->text();
        $price = $node->filter('span.product-price')->text();
        $link = $node->filter('a.product-link')->attr('href');
        $products[] = [
            'title' => $title,
            'price' => $price,
            'link' => $link
        ];
    });
    return $products;
}

$html = '<div class="product-item"><h3 class="product-title">Product 1</h3><span class="product-price">$100</span><a class="product-link" href="/product1">Link</a></div>';
$products = parse_html($html);

foreach ($products as $product) {
    echo "商品名称: " . $product['title'] . "\n";
    echo "商品价格: " . $product['price'] . "\n";
    echo "商品链接: " . $product['link'] . "\n";
    echo "----------------------\n";
}
?>

3. 结合使用

在实际开发中，GuzzleHttp 和 DomCrawler 可以结合使用，以实现从网页抓取数据并解析的功能。以下是完整的示例代码：

php

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

function get_html($url) {
    $client = new Client();
    $response = $client->request('GET', $url, [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
        ]
    ]);
    return $response->getBody()->getContents();
}

function parse_html($html) {
    $crawler = new Crawler($html);
    $products = [];
    $crawler->filter('div.product-item')->each(function (Crawler $node) use (&$products) {
        $title = $node->filter('h3.product-title')->text();
        $price = $node->filter('span.product-price')->text();
        $link = $node->filter('a.product-link')->attr('href');
        $products[] = [
            'title' => $title,
            'price' => $price,
            'link' => $link
        ];
    });
    return $products;
}

function get_product_list($keyword, $page = 1) {
    $base_url = "https://example.com/search"; // 替换为目标平台的商品列表页面 URL
    $url = $base_url . "?keyword=" . urlencode($keyword) . "&page=" . $page;
    $html = get_html($url);
    if ($html) {
        return parse_html($html);
    }
    return [];
}

$keyword = "耳机"; // 替换为实际关键词
$products = get_product_list($keyword);

foreach ($products as $product) {
    echo "商品名称: " . $product['title'] . "\n";
    echo "商品价格: " . $product['price'] . "\n";
    echo "商品链接: " . $product['link'] . "\n";
    echo "----------------------\n";
}
?>