Python利用nltk的clean_html提取htm文件的内容

最新推荐文章于 2024-08-10 07:01:01 发布

夜月xl

最新推荐文章于 2024-08-10 07:01:01 发布

阅读量3.2k

点赞数

分类专栏： Python 文章标签： python html 提取htm内容

本文链接：https://blog.csdn.net/u013045749/article/details/49639809

版权

本文介绍如何利用Python的nltk库配合其他工具，有效地从HTML文件中清洗无关标签并提取文本内容，适合数据预处理和信息提取。

摘要由CSDN通过智能技术生成

import os
import codecs
# import nltk
import re
from pdf_extract import extract_pattern

# clean_html为nltk的库函数
def clean_html(html):
    """
    Copied from NLTK package.
    Remove HTML markup from the given string.

    :param html: the HTML string to be cleaned
    :type html: str
    :rtype: str
    """

    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r" ", " ", cleaned)