简短回答:Scrapy/Parsel选择器.re()和.re_first()方法替换HTML实体(除了<,&)
相反,使用.extract()或.extract_first()获取原始HTML(或原始JavaScript指令),并对提取的字符串使用Python的re模块
长答案:
让我们看一个示例输入和从HTML中提取Javascript数据的各种方法。
HTML示例:
var i = {a:['O'Connor Park']}
使用scrapy选择器(它使用下面的parsel库),您可以使用多种方法提取Javascript片段:>>> import scrapy
>>> t = """
...
...
...
... var i = {a:['O'Connor Park']}
...
...
...
...
...
... """
>>> selector = scrapy.Selector(text=t, type="html")
>>>
>>> # extracting the
>>> selector.xpath('//div/script').extract_first()
u''
>>>
>>> # only getting the text node inside the
>>> selector.xpath('//div/script/text()').extract_first()
u"\n var i = {a:['O'Connor Park']}\n "
>>>
现在,使用.re(或.re_first)可以得到不同的结果:>>> # I'm using a very simple "catch-all" regex
>>> # you are probably using a regex to extract
>>> # that specific "O'Connor Park" string
>>> selector.xpath('//div/script/text()').re_first('.+')
u" var i = {a:['O'Connor Park']}"
>>>
>>> # .re() on the element itself, one needs to handle newlines
>>> selector.xpath('//div/script').re_first('.+')
u'
>>> import re
>>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL))
u''
>>>
HTML实体'已被apostrophe替换。这是由于.re/re_first实现中的^{}调用(请参阅parsel源代码,在^{}函数中)导致的,在简单调用extract()或extract_first()时不使用该调用