Scrapy框架（2）： CSDN 上滑（动态）加载

最新推荐文章于 2023-07-24 15:26:49 发布

Hello-H

最新推荐文章于 2023-07-24 15:26:49 发布

阅读量467

点赞数

分类专栏： Scrapy 文章标签： Scrapy 爬虫

本文链接：https://blog.csdn.net/m0_37897007/article/details/88942783

版权

Scrapy 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Scrapy框架（2）： CSDN 动态加载

一、源码分析
二、Scrapy 实例
- 2.1 分析数据
- 2.2 代码实例

一、源码分析

本文为学习技术分享，如果侵权请联系我！

本文目标网址：https://ask.csdn.net/

step 1：打开开发者工具，找到实现上滑加载的JavaScript代码（Ctrl + f 查询 上滑）

 $(function() {
     var not_loading = true,pageNum = 3,oList = $("#album_detail_wrap");
     $(window).scroll(function() {
         var scrollTop = $(this).scrollTop();
         var scrollHeight = $(document).height() ;
         var windowHeight = $(this).height();
         if (scrollTop + windowHeight >= scrollHeight) {
             if(not_loading){
                 refreshMore();
                 setTimeout(function(){
                     getMore();
                 },100);
             }else{
                 $(".dl_more").remove();
             }
         }
     });

     function getMore(){
         //console.log("getMore......")
         $.ajax({
             type: 'get',
             url: '/questions/ajax_get_questions',
             data:{
                 sort_by:'',
                 page:pageNum,
                 type:''
             },
             //async: false,
             dataType: 'json',
             success: function (resobj) {
                 var totalNum = resobj.total_pages;
                 if(pageNum <=totalNum ){
                     $(".dl_more").remove();
                     oList.append(resobj.oHtml);
                     refreshMore();
                     not_loading = true;
                     pageNum++;
                 }else{
                     not_loading = false;
                     noMore();
                 }
             },
             error: function (err) {
                 console.log(err);
             }
         });
     }

     function noMore(){
         $(".dl_more").remove();
         if(oList.find(".dl_no_more").length ==0){
             oList.append('<div class="dl_no_more" style="font-size:14px; color:red; text-align:center;padding-top:10px; ">我们是很有底线的</div>');
         }
     }

     function refreshMore(){
         if(oList.find(".dl_more").length ==0){
             oList.append('<div class="dl_more" style="font-size:14px; color:red; text-align:center;padding-top:10px;">上滑加载更多</div>');
         }
     }
 });

step 2：找到Ajax中url和data等参数，分析代码

function getMore(){
   //console.log("getMore......")
    $.ajax({
        type: 'get',  // 请求方式
        url: '/questions/ajax_get_questions',  // 请求链接
        data:{
            sort_by:'',  // 排序方式
            page:pageNum,  // 请求的页码
            type:''  // 待定...
        },
        //async: false,
        dataType: 'json',  // 返回格式
        success: function (resobj) {
            var totalNum = resobj.total_pages;
            if(pageNum <=totalNum ){
                $(".dl_more").remove();
                oList.append(resobj.oHtml);  // resobj.oHtml返回的数据
                refreshMore();
                not_loading = true;
                pageNum++;  // 定位到下一页页码
            }else{
                not_loading = false;
                noMore();
            }
        },
        error: function (err) {
            console.log(err);
        }
    });
}

step 3：拼接链接，并在浏览器中测试猜想（或者在开发中工具中找到此访问链接）
在这里插入图片描述

鼠标下滑加载数据，选择时间片段
点开ajax_get_questions?sort_by=&page=3&type=请求项
选择Headers查看请求头信息
选择Response查看返回的数据
浏览器请求
哈哈，忙了几天终于找到它了！！！

二、Scrapy 实例

2.1 分析数据

将得到的JSON数据在 JSON 解析工具中转化（JSON 在线转化工具）
分析得到返回的key - value数据，返回的数据可能会有两种情况
1. 爬取的数据都是以JSON格式返回。这种格式我没有遇到，请参考教程Scrapy实战一:GET方法爬取CSDN主页动态数据
2. 爬取的数据为HTML代码。我遇到但是这种情况，如下图的oHtml数据项。

2.2 代码实例

搭建一个原始的的 Scrapy 爬虫（略，三行命令行起来的那种）
改写 csdn_ask.py

# -*- coding: utf-8 -*-
import scrapy
import json
from scrapy.selector import Selector

class CsdnAskSpider(scrapy.Spider):
    name = 'csdn_ask'
    allowed_domains = ['ask.csdn.net']
    # 爬取目标 URL
    start_urls = ['https://ask.csdn.net/questions/ajax_get_questions?sort_by=&page=3&type=']

    def parse(self, response):
   		# oHtml 接收 JSON 中 oHtml 数据项，其中包含了爬取目标数据
        oHtml = json.loads(response.text)['oHtml']
        # result = Selector(text=你从json提取出来的str).xpath('你的xpath表达式').extract()
        # 从 oHtml 中利用 Selector 选择器，选择爬取的目标数据
        # div[@class="questions_detail_con"] 是我要的数据的父层 DIV
        datums = Selector(text=oHtml).xpath('//div[@class="questions_detail_con"]')
		
		# 从 datums 中进一步解析详细的数据，[:1] 先显示一条数据便于数据的分析
        for data in datums[:1]:
            question_url = data.xpath('.//dl/h1/a/@href').extract_first()
            question_title = data.xpath('.//dl/h1/a/text()').extract_first()
            question_sketch = data.xpath('.//dl/dd/text()').extract_first()
            tags = data.xpath('.//div[@class="tags"]/a/text()').extract()
            answer_num = data.xpath('.//div[answer_num]/span/text()').extract_first()
            time = data.xpath('.//div[@class="q_time"]/span/text()').extract_first()
            print(time, question_title, question_sketch, question_url, tags, answer_num)

Debug代码（也可以直接运行，可忽略此步骤），工具：Pycharm ：
1. 创建项目运行入口文件start_spider.py（见结果图 Project区），加入如下代码，打上断点，右击入口文件，Debug运行。

# -*- coding: utf-8 -*-

# Scrapy spider 项目启动入口

from scrapy import cmdline

cmdline.execute("scrapy crawl csdn_ask".split())

4. 结果图：

在这里插入图片描述
在调试区查看数据是否有误。