python保存html标签_【已解决】Python中过滤html的标签（但保留标签内的内容）

最新推荐文章于 2023-10-13 09:38:22 发布

weixin_39645019

最新推荐文章于 2023-10-13 09:38:22 发布

阅读量715

点赞数

文章标签： python保存html标签

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39645019/article/details/111434796

版权

【问题】

已经通过Python中的BeautifulSoup获得了对应的soup：

LINE 253 : INFO foundDescription=

BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

现在，想要得到其中的description的内容，并且过滤掉其中的br等标签。

【解决过程】

1.当然最土，最笨的办法就是，手动用正则去除掉对应的br标签。

但是想要找个更好的办法。

2.后来从：

发现BeautifulSoup竟然有个renderContents，所以去参考官网文档：

找到对应的解释后，所以去试试：descContents = foundDescription.renderContents();

logging.info("descContents=%s", descContents);

结果是LINE 257 : INFO descContents=

BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

还是有对应的br标签。

3.所以，看来还是算了，还是自己手动暂时此处用正则去处理算了。

写成：descContents = crifanLib.soupContentsToUnicode(foundDescription.contents);

#descContents = foundDescription.renderContents();

logging.info("descContents=%s", descContents);

descHtmlDecoded = crifanLib.decodeHtmlEntity(descContents);

logging.info("descHtmlDecoded=%s", descHtmlDecoded);

descHtmlFiltered = re.sub("
", "", descHtmlDecoded);

descHtmlFiltered = re.sub("
", "", descHtmlFiltered);

logging.info("descHtmlFiltered=%s", descHtmlFiltered);

效果是：LINE 262 : INFO descHtmlFiltered=

BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

基本满足此处需求了。

就此这么着吧。

等遇到更复杂的，再想更好的办法。

【总结】

暂时只能还是通过正则去处理html的tag。

【后记 2013-05-03】

1.后来继续试了试：VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'a'];

soup = BeautifulSoup(origHtml);

for tag in soup.findAll(True):

if tag.name not in VALID_TAGS:

tag.hidden = True;

filteredHtml = soup.renderContents();

logging.info("processed, filteredHtml=%s", filteredHtml);

结果，只是起到，过滤到非法的tag，而不是把tag去掉，保留tag内的内容。

2.后来只能是自己手动删除tag，而保留其中的内容了：def filterHtmlTag(origHtml):

"""

filter html tag, but retain its contents

eg:

Brooklyn, NY 11220

Brooklyn, NY 11220

Bayridgenissan42@yahoo.com

Bayridgenissan42@yahoo.com

stores.ebay.com

stores.ebay.com

www.carfaxonline.com

www.carfaxonline.com

"""

#logging.info("html tag, origHtml=%s", origHtml);

filteredHtml = origHtml;

#Method 1: auto remove tag use re

#remove br

filteredHtml = re.sub("
", "", filteredHtml, flags=re.I);

filteredHtml = re.sub("
", "", filteredHtml, flags=re.I);

#logging.info("remove br, filteredHtml=%s", filteredHtml);

#remove a

filteredHtml = re.sub("]+>(?P[^<>]+?)", "\g", filteredHtml, flags=re.I);

#logging.info("remove a, filteredHtml=%s", filteredHtml);

#remove b,strong

filteredHtml = re.sub("(?P[^<>]+?)", "\g", filteredHtml, re.I);

filteredHtml = re.sub("(?P[^<>]+?)", "\g", filteredHtml, flags=re.I);

#logging.info("remove b,strong, filteredHtml=%s", filteredHtml);

return filteredHtml;

3.以后会继续更新此函数的。

weixin_39645019

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python保存html标签_【已解决】Python中过滤html的标签（但保留标签内的内容）

【问题】已经通过Python中的BeautifulSoup获得了对应的soup：LINE 253 : INFO foundDescription=BAD CREDIT NO CREDITNO PROBLEM!!!CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820 现在，想要得到其中的description的内容，并且过滤掉其中的...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。