问题描述:
在使用middleware进的时候,计划是在scrapy发送请求的时候对其进行拦截,然后自己使用HtmlResponse伪造一个response响应进行返回,传给scrapy调度器。但是,在使用HtmlResponse实例化对象的是时候报错,如下:
Traceback (most recent call last):
File "e:\anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "e:\anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "E:\Scrapy\Jianshu\Jianshu\middlewares.py", line 128, in process_request
response = HtmlResponse(url=self.browser.current_url, body=url_src, request=request)
File "e:\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 31, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
File "e:\anaconda3\lib\site-packages\scrapy\http\response\__init__.py", line 22, in __init__
self._set_body(body)
File "e:\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 47, in _set_body
type(self).__name__)
TypeError: Cannot convert unicode body - HtmlResponse has no encoding
代码如下:
class JianshuSeleniumDownloaderMiddleware(object):
def __init__(self):
self.browser = webdriver.Chrome('E:\Scrapy\chromedriver.exe')
def process_request(self, request, spider):
self.browser.get(request.url)
try:
while True:
show_more = self.browser.find_element_by_class_name("show-more")
time.sleep(1)
if show_more:
show_more.click()
else:
break
except:
pass
url_src = self.browser.page_source
print(url_src)
response = HtmlResponse(url=self.browser.current_url, body=url_src, request=request)
return response
问题分析:
HtmlResponse 没有encoding
报错的错误信息可以找到源代码,所以我找到源代码进行捋逻辑,源代码如下:
def _set_body(self, body):
self._body = b'' # used by encoding detection
if isinstance(body, six.text_type):
if self._encoding is None:
raise TypeError('Cannot convert unicode body - %s has no encoding' %
type(self).__name__)
self._body = body.encode(self._encoding)
else:
super(TextResponse, self)._set_body(body)
错误信息: Cannot convert unicode body - HtmlResponse has no encoding
所以如下代码中,self._encoding为空
if self._encoding is None:
所以在使用HtmlResponse创建对象的时候将enciding写入
修改后代码如下:
response = HtmlResponse(url=self.browser.current_url, body=url_src, request=request, encoding="utf-8")
运行,不再报错