Python：针对HTML内容的数据清洗

最新推荐文章于 2024-04-19 16:25:40 发布

苏寅

最新推荐文章于 2024-04-19 16:25:40 发布

阅读量2.7k

点赞数

分类专栏： Python Scrapy 文章标签： Python 爬虫数据清洗

本文链接：https://blog.csdn.net/qq_34562959/article/details/121631832

版权

Python 同时被 2 个专栏收录

21 篇文章 2 订阅

订阅专栏

Scrapy

1 篇文章 0 订阅

订阅专栏

场景描述

在使用Python爬虫的时候经常需要对爬取的数据进行清洗，以此来过滤掉不需要的内容。对于爬取的结果为文本的数据经常采用正则(re.sub())来进行数据清洗，但是对于爬取的结果为HTML的数据如果还是采用正则来进行数据清洗的话往往会事倍功半，那么针对爬取的结果为HTML的数据又该如何进行数据清洗呢？

代码示例

# -*- coding: utf-8 -*-
import scrapy
import htmlmin
from lxml import etree
from lxml import html
from html import unescape


class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.gongkaoleida.com']
    start_urls = ['https://www.gongkaoleida.com/article/869186']
    # start_urls = ['https://www.gongkaoleida.com/article/869244']

    def parse(self, response):
        content = response.xpath('//article[@class="detail-info"]').getall()[0].replace('\n', '').replace('\r', '')
        content = unescape(content)
        tree = etree.HTML(content)
        # 查找包含“公考雷达”的标签
        str1 = tree.xpath('//p[contains(text(), "公考雷达")] | //a[contains(text(), "公考雷达")]/.. | //div[contains(text(), "公考雷达")]')
        # 查找包含“附件：”或“附件:”或“常见office文件后缀”的标签
        str2 = tree.xpath('//a[re:match(text(), "附件(\w+)?(：|:)") or re:match(text(), "(.doc|.xls|.ppt|.pdf)")]/..', namespaces={"re": "http://exslt.org/regular-expressions"})
        str3 = tree.xpath('//p[re:match(text(), "^(附件)(\w+)?(：|:)") or re:match(text(), "(.doc|.xls|.ppt|.pdf)")]', namespaces={"re": "http://exslt.org/regular-expressions"})
        str4 = tree.xpath('//span[re:match(text(), "附件(\w+)?(：|:)") or re:match(text(), "(.doc|.xls|.ppt|.pdf)")]/../..', namespaces={"re": "http://exslt.org/regular-expressions"})
        str5 = tree.xpath('//em[re:match(text(), "附件(\w+)?(：|:)") or re:match(text(), "(.doc|.xls|.ppt|.pdf)")]/../..', namespaces={"re": "http://exslt.org/regular-expressions"})
        # 查找href中包含gongkaoleida的标签
        str6 = tree.xpath('//*[re:match(@href, "gongkaoleida")]', namespaces={"re": "http://exslt.org/regular-expressions"})
        # 数据清洗
        for i in str1 + str2 + str3 + str4 + str5 + str6:
            p1 = html.tostring(i)
            p2 = unescape(p1.decode('utf-8'))
            content = content.replace(p2, '')
        # 压缩代码
        content = htmlmin.minify(content, remove_empty_space=True, remove_comments=True)
        print(content)

这里采用的是XPath + 正则的方式！

注意事项

使用lxml中的XPath的正则方法(re:match())时需要结合命名空间(namespaces={"re": "http://exslt.org/regular-expressions"})使用，如：
```
str6 = tree.xpath('//*[re:match(@href, "gongkaoleida")]', namespaces={"re": "http://exslt.org/regular-expressions"})
```

使用Scrapy中的XPath的正则方法(re:match())时无需填写命名空间，如：

attachment_title_list = response.xpath('//a[re:match(text(), "(.doc|.xls|.ppt|.pdf)")]/text()').getall()

苏寅

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Python：针对HTML内容的数据清洗

场景描述在使用Python爬虫的时候经常需要对爬取的数据进行清洗，以此来过滤掉不需要的内容。对于爬取的结果为文本的数据经常采用正则(re.sub())来进行数据清洗，但是对于爬取的结果为HTML的数据如果还是采用正则来进行数据清洗的话往往会事倍功半，那么针对爬取的结果为HTML的数据又该如何进行数据清洗呢？代码示例import scrapyfrom lxml import etreefrom lxml import htmlfrom html import unescapeclass Te
复制链接

扫一扫