关于爬取企业信息类的爬虫（二）

最新推荐文章于 2024-12-08 10:30:50 发布

法萌

最新推荐文章于 2024-12-08 10:30:50 发布

阅读量3k

点赞数 1

分类专栏：爬虫文章标签：爬虫 python cookie javascript

本文链接：https://blog.csdn.net/qwdpoiguw/article/details/120572406

版权

爬虫专栏收录该内容

9 篇文章

订阅专栏

在上一篇中，对企查查进行了数据获取，关于爬取企业信息类的爬虫（一），本篇对cookie中的js进行解析。

在企查查的cookie中，主要包含以下几个：

acw_tc=701ec49416327465587377184eb448e3cf457f2bbf56789e0313b461cd
QCCSESSID=42negcpgs96lali07famk9fsp2
qcc_did=7009749f-0fb0-4fb4-ad93-c0bb260c9a81
UM_distinctid=17c27475a7c266-092b008134d5d-513c1743-144000-17c27475a7d6d2
CNZZDATA1254842228=841623527-1632737666-%7C1632737666
zg_did=%7B%22did%22%3A%20%2217c27475d0bb15-0d46738956f8f1-513c1743-144000-17c27475d0c9cc%22%7D
zg_294c2ba1ecc244809c552f8f6fd2a440=%7B%22sid%22%3A%201632746560786%2C%22updated%22%3A%201632746560791%2C%22info%22%3A%201632746560790%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22%22%7D
_uab_collina=163274656143041952478997

要分析几个cookie，怎么能没有所抓到的包呢？fillder抓到的包

一、acw_tc，QCCSESSID

根据fillder抓包可知，这两个cookie由服务器返回值。

二、CNZZDATAXXXXXX，UM_distinctid

非必要。CNZZDATA是CNZZ统计的cookie，而CNZZ数据统计被友盟收购，所以UM_xxx是友盟的cookie

三、qcc_did

qcc_did=7009749f-0fb0-4fb4-ad93-c0bb260c9a81

在各个js文件中搜索该cookie的名称，终于在https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js 这个连接中找到相应的代码。

function setDeviceId(){
    var deviceId = getCookie('qcc_did');
    // console.info(deviceId)
    if(!deviceId){
        var uuid = generateUUID();
        setCookie('qcc_did',uuid,24*365*3); // 3年
    }
}

function generateUUID() {
    var d = new Date().getTime()
    if (window.performance && typeof window.performance.now === 'function') {
        d += performance.now()
    }
    var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
    var r = (d + Math.random() * 16) % 16 | 0
    d = Math.floor(d / 16)
        return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
    })
    return uuid
}

可见qcc_did是由generateUUID这个function生成的，使用pyexecjs即可在python中运行该js函数。

import execjs

qcc_did_js = """function generateUUID() {
    var d = new Date().getTime()
    var window = {}
    if (window.performance && typeof window.performance.now === 'function') {
        d += performance.now()
    }
    var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
    var r = (d + Math.random() * 16) % 16 | 0
    d = Math.floor(d / 16)
        return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
    })
    return uuid
}
"""

qcc_did = execjs.compile(qcc_did_js).call('generateUUID')
print(qcc_did)

事实上，该cookie最重要，设置该值，即可正常访问企查查。剩下的不设置也没事。

四、_uab_collina

_uab_collina=163274656143041952478997

发现是阿里CDN的cookie，在https://g.alicdn.com/sd/ncpc/nc.js?t=1520579483 该js文件中。

参考文章：taobao去验证码文件cookie处理模块浅析 - LiveZingy 即可解析。

先说结论：13位当前毫秒时间+11位随机数字符串

4.1、发现赋值给了变量g

照例直接搜索该变量名，发现赋值给了g

var d, u, p, _ = window,
f = document,
g = "_uab_collina",
h = _.pointman && pointman._now ? pointman._now: (new Date).getTime();

4.2、搜索g，发现function r()中调用了o(g)

若e已有值，则直接返回；若e为空，则运算||后面的值，因此目标变为找到a和i函数。

function r() {
    var e, t = /Firefox/.test(navigator.userAgent);
    if (t) try {
    e = localStorage.getItem(g)
    } catch(n) {}
    return e = e || o(g),
    e || (e = h + a(11), i(g, e, 3650)),
    e
}

4.3、function i()

作用是写入cookie，设置有效时间，由此可知，function r中的三个返回值，只有第二个是将返回值写入cookie的，故_uab_collina的关键在于h + a(11)

function i(e, t, n) {
    n = n || 7;
    var o = new Date;
    o.setTime(o.getTime() + 864e5 * n),
    f.cookie = [encodeURIComponent(e), "=", encodeURIComponent("" + t), ";expires=", o.toGMTString()].join("")
}

4.4、获取包含e个数值的随机数字字符串的函数a(e)

该函数代码如下，其相关的基础知识点有：

substring(start,stop):返回一个子字符串，从start到stop-1处的所有字符；
substring(start):从start到字符串结尾的字符；
substr(start,length):若start<0，则start=length+start; 若length<=0，则返回空；
substr(start）：若length未指定，则从start到字符串的结尾；
Math.random:随机选取大于等于 0.0 且小于 1.0 的值，小数点后会有15~18个数值。

function a(e) {
    for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
    return t.substring(t.length - e)
}

4.5、h变量

getTime()返回值：Java和JavaScript都支持时间类型Date，他们的getTime()方法返回的是毫秒数。默认返回的是13位数字，单位是毫秒。

4.6、实现代码

import execjs

_uab_collina_js = """function a(e) {
for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
return (new Date).getTime() + t.substring(t.length - e)
}"""

_uab_collina = execjs.compile(_uab_collina_js).call('a', 11)
print(_uab_collina)

五、zg_did

{"did": "178b59f60733ad-089245814caeb-45410429-144000-178b59f607440b"}

在https://tongji.qichacha.com/zhuge.js 中，搜索zg_did即可找到

5.1、在zhuge.js文件中，搜索zg_did

y.prototype._initDid = function(e) {
    var t = n.cookie.get("_zg"),
    i = "",
    r = n.hasMobileSdk();
    r.flag && (i = r.getDid()),
    e = e || this.config.did || i || n.UUID(),
    t && n.JSONDecode(t).uuid && (e = n.JSONDecode(t).uuid),
    n.cookie.get("zg_did") || n.cookie.remove("zg_" + this._key);
    var o = n.extend({},
    this.config);
    o.cookie_expire_days = this.config.did_cookie_expire_days,
    this.did = new u("zg_did", o),
    this.did.register_once({
        did: e
    },
    "")
},

直接相关的就是this.did = new u("zg_did", o)，以及后面的注册函数this.did.register_once({did: e}，根据zg_did的形式，很容易得出e就是具体的那一大串的值。

e的来源仅有e = e || this.config.did || i || n.UUID()，因此搜索UUID即可。

5.2、在文件中仅有此处UUID

UUID: (n = function() {
    for (var e = 1 * new Date,
    t = 0; e == 1 * new Date;) t++;
    return e.toString(16) + t.toString(16)
},
function() {
    var e = (screen.height * screen.width).toString(16);
    return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
    function(e) {
        var t, i, n = m,
        r = [],
        o = 0;
        function a(e, t) {
            var i, n = 0;
            for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
            return e ^ n
        }
        for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
        r.unshift(255 & i),
        r.length >= 4 && (o = a(o, r), r = []);
        return r.length > 0 && (o = a(o, r)),
        o.toString(16)
    } () + "-" + e + "-" + n()
})

5.3、此处m的值，经过搜索

g = window.navigator,
v = window.document,
m = g.userAgent,

从5.2的代码中，可以看到其返回值的形式：n()-随机数-function(e)-e-n()，刚好和其一一对应。从代码中可以看到，均是转换成了16进制显示的，因此，先将其转换为10进制。

	16进制	10进制
n()	178b59f60733ad	6627142960362413
随机数	089245814caeb	0150789189585643
function(e)	45410429	1161888809
e	144000	1327104
n()	178b59f607440b	6627142960366603

5.4、实现代码

import execjs
zg_did_js = """UUID: (n = function() {
            for (var e = 1 * new Date,
            t = 0; e == 1 * new Date;) t++;
            return e.toString(16) +"-"+ t.toString(16)
        },
        function c(m) {
            var e = 144000;
            return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
            function(e) {
                var t, i, n = m,
                r = [],
                o = 0;
                function a(e, t) {
                    var i, n = 0;
                    for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
                    return e ^ n
                }
                for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
                r.unshift(255 & i),
                r.length >= 4 && (o = a(o, r), r = []);
                return r.length > 0 && (o = a(o, r)),
                o.toString(16)
            } () + "-" + e + "-" + n()
        })

"""
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
zg_did = execjs.compile(zg_did_js).call('c', ua)
print(zg_did)

注意：此处执行出来的n()并不一直都是14位的，因为有时候t的值转换为16进制时，只有两位，需要在前面加0。

六、zg_294c2ba1ecc2XXXXXXXXXXXX

{"sid": 1632746560786, "updated": 1632746560791, "info": 1632746560790, "superProperty": "{}", "platform": "{}", "utm": "{}", "referrerDomain": ""}

主要找的就4个：

key：294c2ba1ecc2XXXXXXXXXX
"sid": 1632746560786
"updated": 1632746560791
"info": 1632746560790

从值上可以看出，sid、updated、info均是时间，在https://tongji.qichacha.com/zhuge.js 中，搜索验证

6.1、info

y.prototype._info = function(e) {
    var t = this.cookie.props.info,
    i = 1 * new Date;
    ……
    this._batchTrack(r),
    this.cookie.register({
        info: i
    },
    "")
}
},

6.2、updated、sid

y.prototype._session = function(e) {
    var t = !1,
    i = this.cookie.props.updated,
    r = this.cookie.props.sid,
    o = 1 * new Date,
    a = new Date;
    if (0 == r || o > i + 60 * this.config.session_interval_mins * 1e3) {
        ……
        r = e || o,
        r *= 1;
        ……
        this.cookie.register({
            sid: r
        },
        ""),
        t = !0
    }
    return this.cookie.register({
        updated: o
    },
    ""),
    t
},

6.3、key

就在js中：https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js 直接re即可。

window.zhuge.load('294c2ba1ecc244809c552f8f6fd2a440',{
    visualizer: false,
    // debug: true,
    autoTrack:false
});

6.4、实现代码

request = urllib.request.Request(url="https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js", headers=header)
html_1 = opener.open(request, timeout=10).read()

buff = BytesIO(html_1)
f = gzip.GzipFile(fileobj=buff)
htmls = f.read().decode('utf-8')

re.findall("window.zhuge.load\('(.*)'",htmls)[0]

七、总结

以上就是企查查网页中几个cookie的js代码，及其实现方式啦。但是，1、公司的环境中没有安装pyexecjs；2、目前pyexecjs下架了，无法通过pipp安装；导致我只能将其转换为python代码 (ó﹏ò｡)，但js真心不熟，等后续有时间再转吧。

# coding:utf-8
import execjs
import time
import urllib.request
import http.cookiejar
import re
# 由于Accept-Encoding为gzip，需要解压
from io import BytesIO
import gzip
from lxml import etree


def get_header():
    ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
    header = {
        'Connection': 'keep-alive',
        'User-Agent': ua,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding': 'gzip',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }

    qcc_did_js = """function generateUUID() {
        var d = new Date().getTime()
        var window = {}
        if (window.performance && typeof window.performance.now === 'function') {
            d += performance.now()
        }
        var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
        var r = (d + Math.random() * 16) % 16 | 0
        d = Math.floor(d / 16)
            return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
        })
        return uuid
    }"""
    qcc_did = execjs.compile(qcc_did_js).call('generateUUID')
    qcc_did_str = 'qcc_did=' + qcc_did

    _uab_collina_js = """function a(e) {
        for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
        return (new Date).getTime() + t.substring(t.length - e)
    }"""
    _uab_collina = execjs.compile(_uab_collina_js).call('a', 11)
    _uab_collina_str = '_uab_collina=' + _uab_collina

    zg_did_js_1 = """UUID: (n = function() {
                for (var e = 1 * new Date,
                t = 0; e == 1 * new Date;) t++;
                return e.toString(16) + t.toString(16)
            },
            function c(m) {
                var e = 144000;
                return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
                function(e) {
                    var t, i, n = m,
                    r = [],
                    o = 0;
                    function a(e, t) {
                        var i, n = 0;
                        for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
                        return e ^ n
                    }
                    for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
                    r.unshift(255 & i),
                    r.length >= 4 && (o = a(o, r), r = []);
                    return r.length > 0 && (o = a(o, r)),
                    o.toString(16)
                } () + "-" + e + "-" + n()
            })"""
    zg_did_1 = execjs.compile(zg_did_js_1).call('c', ua)

    def n():
        e = int(time.time() * 1000)
        t = 0
        while e == int(time.time() * 1000):
            t = t + 1
        tt = hex(t)[2:]
        while len(tt) < 3:
            tt = '0' + tt
        return hex(e)[2:] + tt

    uuid_js = """function uuid(m) {
                var e = 144000;
                return "-" + Math.random().toString(16).replace(".", "") + "-" +
                function(e) {
                    var t, i, n = m,
                    r = [],
                    o = 0;
                    function a(e, t) {
                        var i, n = 0;
                        for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
                        return e ^ n
                    }
                    for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
                    r.unshift(255 & i),
                    r.length >= 4 && (o = a(o, r), r = []);
                    return r.length > 0 && (o = a(o, r)),
                    o.toString(16)
                } () + "-" + e + "-"
            }"""
    zg_did_2 = n() + execjs.compile(uuid_js).call('uuid', ua) + n()

    zg_did_str = 'zg_did=' + '{"did": "' + zg_did_1 + '"}'

    request = urllib.request.Request(url="https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js", headers=header)
    html_1 = opener.open(request, timeout=10).read()
    buff = BytesIO(html_1)
    f = gzip.GzipFile(fileobj=buff)
    htmls = f.read().decode('utf-8')
    key = re.findall("window.zhuge.load\('(.*)'", htmls)[0]
    sid = int(time.time())
    info = sid + 4
    updated = info + 1
    key = '{"sid": ' + str(sid) + ', "updated": ' + str(updated) + ', "info": ' + str(info) + ', "superProperty": "{}", "platform": "{}","utm": "{}", "referrerDomain": ""}'

    cookie = qcc_did_str# + _uab_collina_str# + zg_did_str + key
    header['Cookie'] = cookie
    return header