记录爬取简书的踩坑过程_为什么有的网页上的class内容爬不到-CSDN博客

本文链接：https://blog.csdn.net/qq_41801603/article/details/106174705

踩坑scrapy框架下爬取简书

如题，在使用scrapy框架爬取简书网的过程中，照常使用crawlspider创建模板，制定规则

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

然后编写页面解析函数，按照往常的思路->打开控制台->查看页面元素->打开chrome的xpath拓展程序->定位到作者，标题，发布日期，头像，文章内容。好了，这个时候xpath语句验证正确，再在url中分割出文章id。部分代码如下：

        author= response.xpath("//span[@class='_22gUMi']/text()").get()
        title = response.xpath("//h1[@class='_1RuRku']/text()").get()
                picture = response.xpath("//div[@class="_2mYfmT"]/a/img/@src").get()
        pub_time = response.xpath("//div[@class="_2mYfmT"]//time/@datetime").get()
        url = response.url
        article_id = url.split("?")[0].split("/")[-1]
        content  = response.xpath("//article[@class='_2rhmJa']").get()

接着在pipeline中写入数据库mysql，这里为了节省io操作的时间，使用异步写入

import pymysql
from twisted.enterprise import adbapi
from pymysql import cursors
class JianShutwistedPipeline(object):
    def __init__(self):

        dbparams = {
            'host':'127.0.0.1',
            'port':3306,
            'user':'root',
            'password':'123',
            'database':'jianshu',
            'charset':'utf8',
            'cursorclass':cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql',**dbparams)
        self._sql = None
        
    @property
    def sql(self):
        if not self._sql:
            self._sql = """
            insert into article(id,title,content,author,avatar,pub_time,origin_url,article_id)
            values(null,%s,%s,%s,%s,%s,%s,%s)
            """
            return  self._sql
        return self._sql
        
    def process_item(self,item,spider):
        defer = self.dbpool.runInteraction(self.insert_item,item)
        defer.addErrback(self.handle_error,item,spider)
    #单条写入函数
    def insert_item(self,cursor,item):
        cursor.execute(self.sql,(item['title'],item['content'],item['author'],item['picture'],item['pub_time'],                                   item['origin_url'],item['article_id']))
    #错误处理函数
    def handle_error(self,error,item,spider):
        print("="*10+error+"="*10)

然后修改相关设置后开始爬取，在控制台发现控制台的消息
程序在往数据库写入的时候也爆出了元组不能为空的错误，这说明我们xpath语句没有在response中筛选出想要的结果。
这个时候发现，标签的部分class是乱码，像是机器生成的进行混淆，这个时候我改用绝对路径试了一下：

        picture = response.xpath("/html/body/div/div/div/div/section/div/div/a/img/@src").get()
        pub_time = response.xpath("/html/body/div/div/div/div/section/div/div/div/div[2]//time/text()").get()

再次执行发现还是无法爬取到，然后查看网页源代码，发现源代码中并不存在上述信息，也就是xpath语法筛选的是已经渲染好的页面。这些是属于通过ajax方式异步请求过来的数据，通过开发者模式下的network->xhr进行搜索，并没有找到发送的地址。
于是我选择使用爬虫中间件，将渲染后的页面返回给解析函数做处理，

from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse

class JianshuDownloaderMiddleware(object):
    def __init__(self):
        driver_path = r"D:\Chromedriver\chromedriver.exe"
        self.driver = webdriver.Chrome(executable_path=driver_path)

    def process_request(self,request,spider):
        self.driver.get(request.url)
        time.sleep(1)
        source = self.driver.page_source
        response = HtmlResponse(url = self.driver.current_url,body=source,request=request,encoding='utf-8')
        return  response

修改设置文件后再次进行爬取：
这个时候，无论是否绝对路径的xpath都能筛选到数据了。
成功写入数据库
导出文件
在这里插入图片描述
这里只是为了查看写入数据库的结果是否正确，就没有导出为csv，也就默认使用了utf8编码，如果导出csv格式，还需要设置编码为excel的gbk，防止打开时乱码。

成功。
如果各位有更好的方式，请不吝赐教。