scrapy获取html标签文本,如何使用Scrapy从网站上获取所有纯文本？

最新推荐文章于 2023-04-20 20:33:53 发布

weixin_39633134

最新推荐文章于 2023-04-20 20:33:53 发布

阅读量827

点赞数 1

文章标签： scrapy获取html标签文本

小编典典

最简单的选择是to 并且找到所有内容：extract //body//text()join

''.join(sel.select("//body//text()").extract()).strip()

这里sel是一个Selector实例。

另一种选择是使用nltk的clean_html()：

>>> import nltk

>>> html = """

...

...

...

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.

... With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

...

...

"""

>>> nltk.clean_html(html)

"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()&

最低0.47元/天解锁文章

weixin_39633134

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy获取html标签文本,如何使用Scrapy从网站上获取所有纯文本？

小编典典最简单的选择是to 并且找到所有内容：extract //body//text()join''.join(sel.select("//body//text()").extract()).strip()这里sel是一个Selector实例。另一种选择是使用nltk的clean_html()：>>> import nltk>>> html = """... ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。