使用python脚本提取html网页上的所有文本信息_python 提取通用html网页摘要-CSDN博客

本文链接：https://blog.csdn.net/lycwhu/article/details/145643377

你可以使用 BeautifulSoup 库来提取 HTML 网页上的所有文本信息。以下是一个示例脚本：

步骤

安装 beautifulsoup4 和 requests（如果尚未安装）：
```
pip install beautifulsoup4 requests
```

Python 脚本：

import requests
from bs4 import BeautifulSoup

def extract_text_from_url(url):
    # 发送 HTTP 请求获取网页内容
    response = requests.get(url)
    response.encoding = response.apparent_encoding  # 处理编码问题
    
    # 确保请求成功
    if response.status_code != 200:
        print(f"无法获取网页: {response.status_code}")
        return None
    
    # 解析 HTML 内容
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取所有可见文本
    for script in soup(["script", "style"]):  # 移除 JavaScript 和 CSS
        script.extract()
    
    text = soup.get_text(separator="\n", strip=True)  # 获取所有文本，按换行符分隔
    return text

if __name__ == "__main__":
    url = "https://example.com"  # 替换为你要爬取的网页
    text = extract_text_from_url(url)
    if text:
        print(text)

脚本说明

requests.get(url): 发送 HTTP 请求获取网页内容。
BeautifulSoup(response.text, 'html.parser'): 解析 HTML。
soup.get_text(separator="\n", strip=True): 提取所有文本并清理格式。
移除 <script> 和 <style> 标签，避免获取无关代码。

示例输出

对于 https://example.com，可能输出：

Example Domain
This domain is for use in illustrative examples in documents.
...

如果你要提取特定元素的文本，比如 <p> 标签，可以使用：

paragraphs = [p.get_text(strip=True) for p in soup.find_all("p")]

需要处理复杂页面或防止反爬机制，可以考虑 selenium 或 scrapy。