scrapy的response中html标签嵌套错误的解决办法

翻不了身的咸鱼6

已于 2024-04-30 13:38:55 修改

阅读量574

点赞数 11

文章标签： python 爬虫 scrapy

于 2024-04-30 13:32:14 首次发布

本文链接：https://blog.csdn.net/m0_50072238/article/details/138340634

版权

本文介绍了如何在Scrapy项目中通过重写`process_response`方法，利用BeautifulSoup处理不规范的HTML标签嵌套问题，以及如何启用下载中间件来确保爬虫正确解析网页。

摘要由CSDN通过智能技术生成

scrapy的response中html标签嵌套错误的解决办法

举例：在获取到response后由于有些网站编写的不规范可能会出现这种标签嵌套错误的情况导致爬虫无法正确解析：

<html>
	<body>
	<p>示例

希望补全后再进行解析：

<html>
	<body>
	<p>示例</p>
	</body>
</html>

一、在中间件中重写process_response方法

在scrapy项目中找到 middlewares.py 文件，并在文件中找到process_response方法：

def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.

    # Must either;
    # - return a Response object
    # - return a Request object
    # - or raise IgnoreRequest
    return response

将该方法修改为如下内容：

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        if not response.body or b'text/html' not in response.headers.get(b'Content-Type', b''):
            return response
        try:
            # 使用BeautifulSoup来规范化HTML
            soup = BeautifulSoup(response.body, 'html5lib')
            cleaned_html = str(soup.prettify())

            # 创建一个新的HtmlResponse对象，替换原有响应内容
            new_response = HtmlResponse(
                url=response.url,
                status=response.status,
                headers=response.headers,
                body=cleaned_html.encode(response.encoding),
                request=request,
            )
            return new_response
        except Exception as e:
            # 如果处理HTML时出现问题，可以选择记录日志或丢弃响应
            spider.logger.error(f"Error cleaning HTML: {e}")
            return response

二、启用项目的下载中间件

打开项目中的settings.py文件，将如下行取消注释：

DOWNLOADER_MIDDLEWARES = {
   "myspider.middlewares.MyspiderDownloaderMiddleware": 543,
}

三、总结

由于scrapy中的response对象不能直接修改，前面一直在spider文件中使用beautifulsoup的pretty（）方法来规范html，由于项目的原因必须用到xpath解析，所以一直报错，原因是在调用scrapy的xpath解析时会把soup对象当做str从而无法解析，只能解析scrapy的response对象。最后想到了直接在下载中间件中处理response对象，成功解决问题。