爬取矿大教务系统成绩

最新推荐文章于 2024-04-01 17:52:00 发布

Kingslayer_

最新推荐文章于 2024-04-01 17:52:00 发布

阅读量6.1k

点赞数 2

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_33278884/article/details/80936714

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近不太忙所以就把上次刚学python时，没能成功爬取教务系统成绩的代码又重新写了一下，但是这一写就是一段时间，其中想过很多方法，又尝试了很多方法，这过程中也摸索学到了一点知识，所以来总结一下吧。

首先打开矿大教务系统登录主页，先分析一下网页。

矿大教务系统

首先分析网站源代码

下面根据使用下方网址通过浏览器f12登陆进去后可以看到如下图所示的信息其中form表单有一个csrftoken,还有加密了的密码，所以简单的post用户名密码是登陆不进去网站的。

http://202.119.200.202/jwglxt/xtgl/login_slogin.html

1530798953033

右键查看源代码可以看到首先在密码处用了autocomplete=”off”，防止浏览器自动填充密码，这或许就是我用splinter一到密码处就输不进去报错的原因吧。或许也不是，毕竟selenium还是可以成功输入的。。。

然后还可以看到此处用了一个csrftoken，可以防止csrf攻击，这也是导致了我想先登录主页保存cookies在通过分析直接跳到得到的成绩页面来爬取成绩失败的原因。所以只能考虑直接访问成绩页面跳转到登录页面，登录成功后便可以爬取到成绩。

具体怎么找到成绩页面就是根据谷歌自带的工具F12一层一层看一下就能找到，具体不详细描述了。url如下。

http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.htmldoType=query&gnmkdm=N305005


<div class="row sl_log_bor4">
            <div class="col-sm-8 hidden-xs sl_log_lf">
                <img class="img-responsive" src="http://202.119.206.62:80/zftal-ui-v5-1.0.2/assets/images/login_bg_pic.jpg" />
            </div>
            <div class="col-sm-4 sl_log_rt">
                <form class="form-horizontal" role="form" action="/jwglxt/xtgl/login_slogin.html" method="post">
                    <!-- 用了csrftoken防止csrf -->
                <input type="hidden" id="csrftoken" name="csrftoken" value="d6f6b735-2438-4476-a520-a4a7a237d110,d6f6b73524384476a520a4a7a237d110"/>
                    <h5>用户登录</h5>
                    <!-- 防止浏览器自动填充密码 -->
                    <input type="text" style="display: none;" autocomplete="off"/>
                    <input type="password" style="display: none;" autocomplete="off"/>
                    <!-- 防止浏览器自动填充密码 end -->


                        <p style="display: none;" id="tips" class="bg_danger sl_danger">
                        </p>

分析加密算法

其中前四个是js加密密码用的，login.js是负责登录的js点进去看，可以看到对密码加密使用的算法。首先是定义了modulus和exponent两个变量，这两个是为了使用rsa加密算法得到公钥使用的，这两个值可以通过下方的url来得到，所以下方登录网址_t就是js里的函数得到的当前时间距离1970/1/1零点时毫秒数，这样的话密码根据时间的不同加密得到的密文也就不同。

http://jwxt.cumt.edu.cn/jwglxt/xtgl/login_slogin.html?language=zh_CN&_t=1530780180937

本来想分析js这个加密算法来通过写一个python来实现，这样就可以通过post用户名、密码在加上网页源代码可以得到的csrftoken值来登录进去了，但是无奈分析了一下发现还是没能实现成功。所以先留个坑，日后来填！

这个加密算法大致过程是先得到modulus和exponent两个变量，然后通过b64tohex函数转成16进制再通过rsa算法生成公钥，进而在利用公钥对密码加密生成私钥。然后私钥在由16进制转成base64编码即为加密密码的密文。

加密算法代码

var modulus,exponent;
$.getJSON(_path+"/xtgl/login_getPublicKey.html?time="+new Date().getTime(),function(data){
        modulus = data["modulus"];
        exponent = data["exponent"];
    });

var rsaKey = new RSAKey();
            rsaKey.setPublic(b64tohex(modulus), b64tohex(exponent));
            var enPassword = hex2b64(rsaKey.encrypt($("#mm").val()));
            $("#mm").val(enPassword);
            $("#hidMm").val(enPassword);

下面我把用到的几个函数从那四个页面提取出来了。日后有机会用python来实现以下。

// Set the public key fields N and e from hex strings
function RSASetPublic(N,E) {
    if(N != null && E != null && N.length > 0 && E.length > 0) {
        this.n = parseBigInt(N,16);
        this.e = parseInt(E,16);
    }
    else
        alert("Invalid RSA public key");
}

// Return the PKCS#1 RSA encryption of "text" as an even-length hex string
function RSAEncrypt(text) {
    var m = pkcs1pad2(text,(this.n.bitLength()+7)>>3);
    if(m == null) return null;
    var c = this.doPublic(m);
    if(c == null) return null;
    var h = c.toString(16);
    if((h.length & 1) == 0) return h; else return "0" + h;
}

function RSAKey() {
    this.n = null;
    this.e = 0;
    this.d = null;
    this.p = null;
    this.q = null;
    this.dmp1 = null;
    this.dmq1 = null;
    this.coeff = null;
}
// public
RSAKey.prototype.setPublic = RSASetPublic;
RSAKey.prototype.encrypt = RSAEncrypt;

var b64map="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
var b64pad="=";

function hex2b64(h) {
    var i;
    var c;
    var ret = "";
    for(i = 0; i+3 <= h.length; i+=3) {
        c = parseInt(h.substring(i,i+3),16);
        ret += b64map.charAt(c >> 6) + b64map.charAt(c & 63);
    }
    if(i+1 == h.length) {
        c = parseInt(h.substring(i,i+1),16);
        ret += b64map.charAt(c << 2);
    }
    else if(i+2 == h.length) {
        c = parseInt(h.substring(i,i+2),16);
        ret += b64map.charAt(c >> 2) + b64map.charAt((c & 3) << 4);
    }
    while((ret.length & 3) > 0) ret += b64pad;
    return ret;
}

// convert a base64 string to hex
function b64tohex(s) {
    var ret = ""
    var i;
    var k = 0; // b64 state, 0-3
    var slop;
    for(i = 0; i < s.length; ++i) {
        if(s.charAt(i) == b64pad) break;
        v = b64map.indexOf(s.charAt(i));
        if(v < 0) continue;
        if(k == 0) {
            ret += int2char(v >> 2);
            slop = v & 3;
            k = 1;
        }
        else if(k == 1) {
            ret += int2char((slop << 2) | (v >> 4));
            slop = v & 0xf;
            k = 2;
        }
        else if(k == 2) {
            ret += int2char(slop);
            ret += int2char(v >> 2);
            slop = v & 3;
            k = 3;
        }
        else {
            ret += int2char((slop << 2) | (v >> 4));
            ret += int2char(v & 0xf);
            k = 0;
        }
    }
    if(k == 1)
        ret += int2char(slop << 2);
    return ret;
}

通过以上的分析最终还是选择了selenium这个自动化测试工具，据说selenium+PhantomJS是爬虫一大杀器。

我选择了selenium+firefox ，首先需要下一个和浏览器匹配的geckodriver.exe版本。还是通过模拟浏览器登录后直接保存cookie然后爬取成绩。

爬虫代码

from selenium import webdriver
import requests
import json

driver = webdriver.Firefox()
session = requests.session()


def get_cookie():
    driver.get('http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.html?doType=query&gnmkdm=N305005&queryModel.showCount=200')
    driver.find_element_by_id('yhm').send_keys('08133xxx')
    driver.find_element_by_id('mm').send_keys('XXXXXXX')
    driver.find_element_by_id('dl').click()
    cook = driver.get_cookies()
    for item in cook:
        cookie = item['name'] + '=' + item['value']
    return cookie


def get_score(cookie):
    headers = {
        'cookie': cookie,
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
    }
    url = 'http://202.119.206.62/jwglxt/cjcx/cjcx_cxDgXscj.htmldoType=query&gnmkdm=N305005&queryModel.showCount=200'
    r = session.get(url, headers=headers)
    return r.text

# 这个函数是很久之前刚学python写的了，很丑陋，不过拿到成绩后怎么分析就可以随便写了。
def analyse(text):
    json_dict = json.loads(text, encoding="utf-8")
    json_cj = json_dict['items']
    a = 0
    for cj in json_cj:
        cjj = cj['bfzcj']
        cjj = int(cjj)
        if cjj >= 60:
            print('学科名称:', cj['kcmc'], ' ', '成绩:', cj['bfzcj'])
            a += 1
    print('共计', a, '门学科')
    print(' ')
    for cj in json_cj:
        cjj = cj['bfzcj']
        cjj = int(cjj)
        if cjj < 60:
            print('挂科科目', cj['kcmc'], '挂科成绩', cj['bfzcj'])


if __name__ == '__main__':
    cookie = get_cookie()
    text = get_score(cookie)
    analyse(text)
    driver.quit()

运行结果如下图

发现通过bb了一大堆代码还是如此简单…反正能爬到数据就行了是吧…基本原理还是通过cookie（客户端）和session（服务端）来实现的。

遇到动态的js如何爬取时，可以通过一层一层分析找到数据的html进而进行爬取。

Kingslayer_

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
7
评论
爬取矿大教务系统成绩

最近不太忙所以就把上次刚学python时，没能成功爬取教务系统成绩的代码又重新写了一下，但是这一写就是一段时间，其中想过很多方法，又尝试了很多方法，这过程中也摸索学到了一点知识，所以来总结一下吧。首先打开矿大教务系统登录主页，先分析一下网页。矿大教务系统首先分析网站源代码下面根据使用下方网址通过浏览器f12登陆进去后可以看到如下图所示的信息其中form表单有一个csrfto...
复制链接

扫一扫

专栏目录