从 Excel 表格中读取网址列表，用Baiduspider爬取网页标题，并将结果保存到新的 Excel 文件中

懒员员

已于 2024-02-01 11:37:08 修改

阅读量657

点赞数 8

分类专栏： python 文章标签： python

于 2024-01-26 16:47:07 首次发布

本文链接：https://blog.csdn.net/weixin_44523517/article/details/135869663

版权

python 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了如何使用Python脚本，结合xlrd、requests、BeautifulSoup和openpyxl库，从Excel表中读取网址，抓取网页的<title>标签内容，针对Baiduspider的UA信息进行识别，并将结果保存至新的Excel文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Baiduspider也叫百度蜘蛛，是百度搜索引擎的一个自动程序，它的作用是访问互联网上的网页，建立索引数据库，使用户能在百度搜索引擎中搜索到网站相关内容。
有两个方式可以判断百度蜘蛛。
方式一：查看UA信息。
方式二：双向DNS解析认证。
现在使用的是方式一：查看UA信息。
如果UA信息不对，可以直接判断为非百度搜索的蜘蛛。目前UA分为移动、PC、和小程序三个应用场景。

使用Python的示例代码，用于爬取提供的Excel表中的网址，并提取每个网页的<title>标签中的内容，然后将结果保存到新的Excel表格中。

一、安装所需的库：

xlrd: 用于读取 Excel 文件（.xls 格式）的库。
requests: 用于发送 HTTP 请求的库，常用于网络爬虫和 Web 开发中。
BeautifulSoup: 用于解析 HTML 和 XML 文档的库，提供了简单和有效的方式来浏览、搜索和修改文档树。
openpyxl: 用于读写 Excel 文件（.xlsx 格式）的库，可以创建、修改和保存 Excel 文件。
tqdm: 用于在 Python 中添加进度条的库，可以在循环中显示进度条，让用户清楚地了解任务的进展。

这些库都是 Python 的第三方库，可以通过 pip 工具进行安装。如果你还没有安装这些库，可以使用以下命令进行安装：

pip install xlrd requests beautifulsoup4 openpyxl tqdm

二、完整代码

新建一个xx.py 文件：

import xlrd
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from urllib.parse import urlparse
from tqdm import tqdm

# 百度移动浏览器的 UA
baidu_mobile_ua = "Mozilla/5.0 (Linux; Android 10; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36"

def main():
    input_excel = "所有.xls"  # 输入的 Excel 文件名
    output_excel = "output.xlsx"  # 输出的 Excel 文件名
    output_workbook = Workbook()
    output_sheet = output_workbook.active
    workbook = xlrd.open_workbook(input_excel)
    sheet = workbook.sheet_by_index(0)
    total_cells = sheet.nrows * sheet.ncols
    progress_bar = tqdm(total=total_cells, desc="Processing URLs", unit="cell", ascii=True, bar_format="{desc}: {percentage:3.0f}%|{n}/{total}")
    for row_idx in range(0, sheet.nrows):
        for col_idx in range(sheet.ncols):
            url = sheet.cell_value(row_idx, col_idx)
            parsed_url = urlparse(url)
            if not parsed_url.scheme:
                url = "http://" + url
            try:
                response = requests.get(url, headers={"User-Agent": baidu_mobile_ua})
                response.raise_for_status()
                encoding = response.encoding if 'charset' in response.headers.get('content-type', '').lower() else None
                soup = BeautifulSoup(response.content, 'html.parser', from_encoding=encoding)
                title_tag = soup.find('title')
                title = title_tag.text.strip() if title_tag else 'No title'
            except requests.exceptions.RequestException as e:
                title = f"Error: {str(e)}"
            except Exception as e:
                title = f"Error: {str(e)}"
            progress_bar.update(1)
            output_sheet.append([url, title])
    progress_bar.close()
    output_workbook.save(output_excel)
    print(f"输出文件 '{output_excel}' 已保存.")

if __name__ == "__main__":
    main()

三、运行

使用命令 python xx.py 运行Python文件，会得到一个output.xlsx文件。

python xx.py

四、只运行单列数据

例：查询当前行 B 列的数据

import xlrd
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from urllib.parse import urlparse
from tqdm import tqdm

# 百度移动浏览器的 UA
baidu_mobile_ua = "Mozilla/5.0 (Linux; Android 10; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36"

def main():
    input_excel = "单列.xls"  # 输入的 Excel 文件名
    output_excel = "output.xlsx"  # 输出的 Excel 文件名
    output_workbook = Workbook()
    output_sheet = output_workbook.active
    workbook = xlrd.open_workbook(input_excel)
    sheet = workbook.sheet_by_index(0)
    total_cells = sheet.nrows
    progress_bar = tqdm(total=total_cells, desc="Processing URLs", unit="cell", ascii=True, bar_format="{desc}: {percentage:3.0f}%|{n}/{total}")

    # 遍历 B 列中的每个单元格，获取网址并爬取数据
    for row_idx in range(sheet.nrows):
        url = sheet.cell_value(row_idx, 1)  # 获取当前行 B 列的数据
        parsed_url = urlparse(url)
        if not parsed_url.scheme:
            url = "http://" + url
        try:
            response = requests.get(url, headers={"User-Agent": baidu_mobile_ua})
            response.raise_for_status()
            encoding = response.encoding if 'charset' in response.headers.get('content-type', '').lower() else None
            soup = BeautifulSoup(response.content, 'html.parser', from_encoding=encoding)
            title_tag = soup.find('title')
            title = title_tag.text.strip() if title_tag else 'No title'
        except requests.exceptions.RequestException as e:
            title = f"Error: {str(e)}"
        except Exception as e:
            title = f"Error: {str(e)}"
        progress_bar.update(1)
        output_sheet.append([url, title])
    progress_bar.close()
    output_workbook.save(output_excel)
    print(f"输出文件 '{output_excel}' 已保存.")

if __name__ == "__main__":
    main()

五、查询<noscript>标签是否存在

有的输出‘x’没有的输出‘√’

import xlrd
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from urllib.parse import urlparse
from tqdm import tqdm

# 百度移动浏览器的 UA
baidu_mobile_ua = "Mozilla/5.0 (Linux; Android 10; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36"

def main():
    input_excel = "noscript.xls"  # 输入的 Excel 文件名
    output_excel = "output.xlsx"  # 输出的 Excel 文件名
    output_workbook = Workbook()
    output_sheet = output_workbook.active
    workbook = xlrd.open_workbook(input_excel)
    sheet = workbook.sheet_by_index(0)
    total_cells = sheet.nrows
    progress_bar = tqdm(total=total_cells, desc="Processing URLs", unit="cell", ascii=True, bar_format="{desc}: {percentage:3.0f}%|{n}/{total}")

    # 遍历 A 列中的每个单元格，获取网址并查询<head>标签内是否有 <noscript> 标签
    for row_idx in range(sheet.nrows):
        url = sheet.cell_value(row_idx, 0)  # 获取当前行 A 列的数据
        parsed_url = urlparse(url)
        if not parsed_url.scheme:
            url = "http://" + url

        try:
            response = requests.get(url, headers={"User-Agent": baidu_mobile_ua})
            response.raise_for_status()  # 检查是否有HTTP错误
            soup = BeautifulSoup(response.content, 'html.parser')
            head_tag = soup.find('head')
            has_noscript = False
            if head_tag:
                has_noscript = head_tag.find('noscript') is not None
            result = '×' if has_noscript else '√'
        except requests.exceptions.RequestException as e:
            result = f"Error: {str(e)}"
        except Exception as e:
            result = f"Error: {str(e)}"
        progress_bar.update(1)
        output_sheet.append([url, result])
    progress_bar.close()
    output_workbook.save(output_excel)
    print(f"输出文件 '{output_excel}' 已保存.")

if __name__ == "__main__":
    main()