Python爬虫将爬取到的内容转换成word文件返回

最新推荐文章于 2024-06-03 09:25:06 发布

迷路的小鹿斑比_Perry

最新推荐文章于 2024-06-03 09:25:06 发布

阅读量665

点赞数

文章标签： python 爬虫 word

本文链接：https://blog.csdn.net/m0_63229258/article/details/132439189

版权

from docx import Document
import requests
from bs4 import BeautifulSoup

def scrape_and_save_as_word(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # 在这里进行爬取和解析操作，获取所需的内容
    # 假设获取的内容存储在变量 content 中

    content = ""
    # 示例：从 xixixixhahah.com 页面上爬取标题为 h2 的所有内容
    headings = soup.find_all('h2')
    for heading in headings:
        content += heading.text + "\n"

    doc = Document()
    doc.add_paragraph(content)
    
    output_file = "scraped_document.docx"
    doc.save(output_file)  # 保存 Word 文档

    return output_file  # 返回文件名

# 示例使用
url = 'https://xixixiixhahah.com/'  # 目标网页的 URL
word_document = scrape_and_save_as_word(url)
print("Word 文档已保存为:", scraped_document)

word文件保存在main文件的同级目录下。