pythonpyquery去掉br_基于xpath选择器、PyQuery、正则表达式的格式清理工具详解

最新推荐文章于 2021-06-05 07:15:23 发布

weixin_39722965

最新推荐文章于 2021-06-05 07:15:23 发布

阅读量617

点赞数

文章标签： pythonpyquery去掉br

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39722965/article/details/113647539

版权

本文介绍了如何使用Python的XPath、PyQuery和正则表达式清理HTML内容。首先，通过XPath清除不必要的元素如iframe、form等，接着用PyQuery处理标签属性，补充图片URL并删除其他属性。最后，利用正则表达式进一步清理空格和换行。文章详细展示了清理过程，并提供了一个完整的清理工具类。

摘要由CSDN通过智能技术生成

1，使用xpath清理不必要的标签元素，以及无内容标签

from lxml import etree

def xpath_clean(self, text: str, xpath_dict: dict) -> str:

'''

xpath 清除不必要的元素

:param text: html_content

:param xpath_dict: 清除目标xpath

:return: string type html_content

'''

remove_by_xpath = xpath_dict if xpath_dict else dict()

# 必然清除的项目除非极端情况一般这些都是要清除的

remove_by_xpath.update({

'_remove_2': '//iframe',

'_remove_4': '//button',

'_remove_5': '//form',

'_remove_6': '//input',

'_remove_7': '//select',

'_remove_8': '//option',

'_remove_9': '//textarea',

'_remove_10': '//figure',

'_remove_11': '//figcaption',

'_remove_12': '//frame',

'_remove_13': '//video',

'_remove_14': '//script',

'_remove_15': '//style'

})

parser = etree.HTMLParser(remove_blank_text=True, remove_comments=True)

selector = etree.HTML(text, parser=parser)

# 常规删除操作，不需要的标签删除

for xpath in remove_by_xpath.values():

for bad in selector.xpath(xpath):

bad_string = etree.tostring(bad, encoding='utf-8',

pretty_print=True).decode()

logger.debug(f"clean article content : {bad_string}")

bad.getparent().remove(bad)

skip_tip = "name()='img' or name()='tr' or " \

"name()='th' or name()='tbody' or " \

"name()='thead' or name()='table'"

# 判断所有p标签，是否有内容存在，没有的直接删除

for p in selector.xpath(f"//*[not({skip_tip})]"):

# 跳过逻辑

if p.xpath(f".//*[{skip_tip}]") or \

bool(re.sub('\s', '', p.xpath('string(.)'))):

continue

bad_p = etree.tostring(p, encoding='utf-8',

pretty_print=True).decode()

logger.debug(f"clean p tag : {bad_p}")

p.getparent().remove(p)

return etree.tostring(selector, encoding='utf-8',

pretty_print=True).decode()

2，使用pyquery清理标签属性，并返回处理后源码和纯净文本

#!/usr/bin/env python

# -*-coding:utf-8-*-

from pyquery import PyQue

最低0.47元/天解锁文章

weixin_39722965

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。