如何使用 Python 编写简单的百度搜索爬虫

LIY若依

已于 2024-07-13 19:05:03 修改

阅读量136

点赞数 2

分类专栏： python 文章标签： python 爬虫开发语言

于 2024-07-13 17:19:10 首次发布

本文链接：https://blog.csdn.net/m0_74972192/article/details/140402954

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

在本篇博客中，我将展示如何使用 Python 编写一个简单的百度搜索爬虫。这个爬虫可以自动化地从百度获取搜索结果，并提取每个结果的标题和链接。我们将使用 requests 库来发送 HTTP 请求，使用 BeautifulSoup 库来解析 HTML 内容。

需求分析

在实现爬虫之前，我们需要明确以下需求：

通过构建百度搜索的 URL 来发送搜索请求。
解析百度搜索结果页面，提取每个结果的标题和链接。
将搜索结果以列表形式返回，方便后续处理和展示。

使用库

我们需要安装两个 Python 库：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML 内容。

安装这两个库可以使用以下命令：

pip install requests beautifulsoup4

1. 导入库

   python
   import requests
   from bs4 import BeautifulSoup

我们首先导入了 `requests` 和 `BeautifulSoup` 库。

2. 定义搜索函数

   python
   def baidu_search(keyword):

定义一个名为 `baidu_search` 的函数，接受搜索关键字作为参数。

3. 设置请求头

   python
   headers = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'
   }

设置请求头信息，模拟浏览器访问，以防止被百度识别为爬虫。

4. 构建搜索 URL 并发送请求

   search_url = f"https://www.baidu.com/s?wd={keyword}"
   response = requests.get(search_url, headers=headers)

构建搜索 URL，并使用 `requests` 库发送 GET 请求。

5. 检查请求状态并解析响应内容

   if response.status_code == 200:
       soup = BeautifulSoup(response.text, 'html.parser')

检查请求是否成功，如果成功，使用 `BeautifulSoup` 解析响应内容。

6. 查找并提取搜索结果

   python
   search_results = soup.find_all('h3', class_='t')

查找所有包含搜索结果的 HTML 元素，并提取其中的标题和链接。

7. 返回结果

   results = []
   for result in search_results:
       title = result.get_text()
       link = result.a['href']
       results.append({'title': title, 'link': link})

   return results

将提取的标题和链接存储在字典列表中并返回。

8. 测试爬虫

   python
   keyword = "编程"
   search_results = baidu_search(keyword)
   if search_results:
       print(f"关键字 '{keyword}' 的搜索结果：")
       for idx, result in enumerate(search_results, 1):
           print(f"{idx}. {result['title']}")
           print(f"   链接: {result['link']}")
           print()
   else:
       print("未能获取搜索结果。")

调用 `baidu_search` 函数进行测试，并打印搜索结果。

运行结果

代码实现

以下是完整的代码实现：

import requests
from bs4 import BeautifulSoup

def baidu_search(keyword):
    # 设置请求头部信息，模拟浏览器访问
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'
    }

    # 构建搜索URL
    search_url = f"https://www.baidu.com/s?wd={keyword}"

    # 发送GET请求
    response = requests.get(search_url, headers=headers)

    # 检查请求是否成功
    if response.status_code == 200:
        # 解析响应内容
        soup = BeautifulSoup(response.text, 'html.parser')

        # 查找搜索结果的标题和链接
        search_results = soup.find_all('h3', class_='t')

        # 提取标题和链接
        results = []
        for result in search_results:
            title = result.get_text()
            link = result.a['href']
            results.append({'title': title, 'link': link})

        return results
    else:
        print("请求失败！")
        return None

# 测试爬虫功能
keyword = "编程"
search_results = baidu_search(keyword)
if search_results:
    print(f"关键字 '{keyword}' 的搜索结果：")
    for idx, result in enumerate(search_results, 1):
        print(f"{idx}. {result['title']}")
        print(f"   链接: {result['link']}")
        print()
else:
    print("未能获取搜索结果。")

LIY若依

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
如何使用 Python 编写简单的百度搜索爬虫

在本篇博客中，我将展示如何使用 Python 编写一个简单的百度搜索爬虫。这个爬虫可以自动化地从百度获取搜索结果，并提取每个结果的标题和链接。我们将使用requests库来发送 HTTP 请求，使用库来解析 HTML 内容。
复制链接

扫一扫