scrapy 选择html,如何从Scrapy选择器中提取原始html？

最新推荐文章于 2024-08-25 05:45:00 发布

嘻嘻哈哈哦哦吧

最新推荐文章于 2024-08-25 05:45:00 发布

阅读量532

点赞数

文章标签： scrapy 选择html

简短回答：Scrapy/Parsel选择器.re()和.re_first()方法替换HTML实体(除了<，&)

相反，使用.extract()或.extract_first()获取原始HTML(或原始JavaScript指令)，并对提取的字符串使用Python的re模块

长答案：

让我们看一个示例输入和从HTML中提取Javascript数据的各种方法。

HTML示例：

var i = {a:['O'Connor Park']}

使用scrapy选择器(它使用下面的parsel库)，您可以使用多种方法提取Javascript片段：>>> import scrapy

>>> t = """

...

...

...

... var i = {a:['O'Connor Park']}

...

...

...

...

...

... """

>>> selector = scrapy.Selector(text=t, type="html")

>>>

>>> # extracting the

>>> selector.xpath('//div/script').extract_first()

u''

>>>

>>> # only getting the text node inside the

>>> selector.xpath('//div/script/text()').extract_first()

u"\n var i = {a:['O'Connor Park']}\n "

>>>

现在，使用.re(或.re_first)可以得到不同的结果：>>> # I'm using a very simple "catch-all" regex

>>> # you are probably using a regex to extract

>>> # that specific "O'Connor Park" string

>>> selector.xpath('//div/script/text()').re_first('.+')

u" var i = {a:['O'Connor Park']}"

>>>

>>> # .re() on the element itself, one needs to handle newlines

>>> selector.xpath('//div/script').re_first('.+')

u'

>>> import re

>>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL))

u''

>>>

HTML实体'已被apostrophe替换。这是由于.re/re_first实现中的^{}调用(请参阅parsel源代码，在^{}函数中)导致的，在简单调用extract()或extract_first()时不使用该调用

嘻嘻哈哈哦哦吧

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。