今日头条用户文章标题及详情页爬取

本文介绍了在爬取今日头条用户文章时遇到的挑战,包括加密签名的正确生成、环境变量的补充、详情页URL的正确获取以及处理列表页数据为空的情况。通过分析和调试,找到了解决问题的关键点,分享了爬虫实践中的宝贵经验。
摘要由CSDN通过智能技术生成

总结一下遇到问题,之前爬头条用户文章,是加密的嘛。然后找了相关文章看了一下,懂了。

胜利的道路并不是一帆风顺。

1、window.byted_acrawler.sign({url:url}) 需要传入一个url,但是这个url,当时打断点,看到一个就随便填了进去,发现并不行。生成的_signature并不能用。

解决:太心急了,我要爬的是

文章这个板块,刷新网页没等出来就断点了,就直接选了个url填进去,应该继续断点,找到文章加载出来的那个时候,断点的那个url传入

2、 补环境,网上有很多补环境。但是我不知道能不能通用,我还是按照console输出去补的环境

 可以看到console输出里面有document、location、navigator都有的,我是复制这里面的

3、详情页面,我拿到列表的json,里面有url,但是不能用里面的url,这点我太大意了。

这里面的url,article_url都不要用,如果用这个url作为详情页去爬详情页,什么都没有。 因为前面没有加www,我建议是https://www.toutiao.com/a......./当做详情页链接去爬取。就可以有了。

4、列表文章页有时候数据为空问题。我发现我爬头条,很卡很卡,和我网速无关。经常会出现没数据,所以加了一个判断,数据为空就重新调用函数。

# -*- coding: utf-8 -*-
import random,time

import requests,execjs,json
from lxml import etree

headers = {
    "accept": "application/json, text/plain, */*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9",
    "cookie": "_S_IPAD=0; WIN_WH=375_812; FRM=new; PIXIEL_RATIO=3; _ga=GA1.2.116917834.1620437294; csrftoken=7f9d24dc3abc2aeab049c30b8d86ab5f; tt_webid=7023928361337013791; MONITOR_DEVICE_ID=f9c022a2-96f7-402b-a659-887a67bf9de2; tt_webid=7023928361337013791; _S_DPR=1; _S_WIN_WH=1920_937; _S_UA=Mozilla%2F5.0%20(Windows%20NT%2010.0%3B%20Win64%3B%20x64)%20AppleWebKit%2F537.36%20(KHTML%2C%20like%20Gecko)%20Chrome%2F96.0.4664.45%20Safari%2F537.36; passport_csrf_token_default=ef1b86b4eb117d3299bee8368d01ab81; passport_csrf_token=ef1b86b4eb117d3299bee8368d01ab81; __ac_signature=_02B4Z6wo00f010IVpCgAAIDCIR9kQxtOfytCMaCAALFKNgvOj28czTtzYxDw-ogM4jdjKXztAzAfWMK68gbX3B0jTjzqas8pyIqPhuRuZzM6ufZczrVdBY3EBepHCIYbinixJJsCvd.7NMII6d; sso_uid_tt_ss=275b0585fed5f121eb96bc20ab5d0f7f; toutiao_sso_user=89c87ee5448b210895bb6f776878f045; toutiao_sso_user_ss=89c87ee5448b210895bb6f776878f045; sid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; ssid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; sso_uid_tt=275b0585fed5f121eb96bc20ab5d0f7f; uid_tt=02d6a9f1037a25d0463f338f9d289ad3; uid_tt_ss=02d6a9f1037a25d0463f338f9d289ad3; sid_tt=570057f240cce2b4ad32a018872e5faf; sessionid=570057f240cce2b4ad32a018872e5faf; sessionid_ss=570057f240cce2b4ad32a018872e5faf; sid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; ssid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; sid_guard=570057f240cce2b4ad32a018872e5faf%7C1639642541%7C3024000%7CThu%2C+20-Jan-2022+08%3A15%3A41+GMT; MONITOR_WEB_ID=93282678874; s_v_web_id=verify_kx9p6yqz_XRzeZVXJ_RD5E_44BU_820w_smMAHTFIaw5i; tt_anti_token=cDSrB8zueOSwHGI-cb679e9b34cc9430a1ff05eb213fc0b8a0de7b46b3b92a4179f66a4f53d29416; ttwid=1%7C_KtNzqt5kI03bn7fBVhj-rQCaJxjAKSak6Iioy2mFYs%7C1639703837%7C0f0c312dd7b0efa93a9569861116940f73f56e64888b66871a8a6d5ce2a0e203; tt_scid=H-LKqsTsVysKe3N.at4iR.QdIlw895hLWBduQA.NjQ-g1h7Kw0jDEC4pDwrC6HyV0023",
    "referer": "https://www.toutiao.com/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/??tab=article?tab=article?tab=article",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}

headers_detail = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate",
    "accept-language": "zh-CN,zh;q=0.9",
    "cache-control": "max-age=0",
    "cookie": "_S_IPAD=0; WIN_WH=375_812; FRM=new; PIXIEL_RATIO=3; _ga=GA1.2.116917834.1620437294; csrftoken=7f9d24dc3abc2aeab049c30b8d86ab5f; tt_webid=7023928361337013791; MONITOR_DEVICE_ID=f9c022a2-96f7-402b-a659-887a67bf9de2; tt_webid=7023928361337013791; _S_DPR=1; _S_WIN_WH=1920_937; _S_UA=Mozilla%2F5.0%20(Windows%20NT%2010.0%3B%20Win64%3B%20x64)%20AppleWebKit%2F537.36%20(KHTML%2C%20like%20Gecko)%20Chrome%2F96.0.4664.45%20Safari%2F537.36; passport_csrf_token_default=ef1b86b4eb117d3299bee8368d01ab81; passport_csrf_token=ef1b86b4eb117d3299bee8368d01ab81; __ac_signature=_02B4Z6wo00f010IVpCgAAIDCIR9kQxtOfytCMaCAALFKNgvOj28czTtzYxDw-ogM4jdjKXztAzAfWMK68gbX3B0jTjzqas8pyIqPhuRuZzM6ufZczrVdBY3EBepHCIYbinixJJsCvd.7NMII6d; sso_uid_tt_ss=275b0585fed5f121eb96bc20ab5d0f7f; toutiao_sso_user=89c87ee5448b210895bb6f776878f045; toutiao_sso_user_ss=89c87ee5448b210895bb6f776878f045; sid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; ssid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; sso_uid_tt=275b0585fed5f121eb96bc20ab5d0f7f; uid_tt=02d6a9f1037a25d0463f338f9d289ad3; uid_tt_ss=02d6a9f1037a25d0463f338f9d289ad3; sid_tt=570057f240cce2b4ad32a018872e5faf; sessionid=570057f240cce2b4ad32a018872e5faf; sessionid_ss=570057f240cce2b4ad32a018872e5faf; sid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; ssid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; sid_guard=570057f240cce2b4ad32a018872e5faf%7C1639642541%7C3024000%7CThu%2C+20-Jan-2022+08%3A15%3A41+GMT; MONITOR_WEB_ID=93282678874; s_v_web_id=verify_kx9p6yqz_XRzeZVXJ_RD5E_44BU_820w_smMAHTFIaw5i; tt_anti_token=nSAlvifb-5d23fbe65f1e5b16e89dc18373d171efd78a4f366e7e426007e86ed2a75b7ba9; ttwid=1%7C_KtNzqt5kI03bn7fBVhj-rQCaJxjAKSak6Iioy2mFYs%7C1639703742%7Cf86a130fd0e0ab9fa0c00e44d6d99ca0dac86881c80929fa300ccf9dab0abca7; tt_scid=1jGPvYY-oOkiHriVys13Wgx50YecYX.wb7eP.BYr20LVZ4MMQvRktJFnndDtXxroe227",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}

master_url = "https://www.toutiao.com/api/pc/list/user/feed?category=pc_profile_article&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time={max_behot_time}&aid=24&app_name=toutiao_web&_signature={_signature}"

infos_list = []

def get_signature(max_behot_time):
    with open("answer.js",encoding='utf-8') as f:
        jscode = f.read()
    signature = execjs.compile(jscode).call("get_signature","https://www.toutiao.com/api/pc/list/user/feed?category=pc_profile_article&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time={max_behot_time}&aid=24&app_name=toutiao_web".format(max_behot_time=max_behot_time))
    print(signature)
    return signature
# https://www.toutiao.com/api/pc/list/user/feed?category=profile_all&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time=0&aid=24&app_name=toutiao_web
# https://www.toutiao.com/api/pc/list/user/feed?category=profile_all&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time=1639381182067&aid=24&app_name=toutiao_web

def get_detail(url):
    time.sleep(random.randint(1,2))
    print("进入详情页.....")
    content = ""
    response = requests.get(url=url,headers=headers_detail)
    html = etree.HTML(response.text)
    response.encoding = 'utf-8'
    # print(response.text)
    infos = html.xpath("//div[@class='article-content']/article/*")
    print(len(infos))
    if infos:
        if len(infos) < 3:
            for info in infos:
                text = "".join(info.xpath(".//text()"))
                # print(text)
                content += text + "\n"
            return content
        else:
            for info in infos[:-3]:
                text = "".join(info.xpath(".//text()"))
                # print(text)
                content += text + "\n"
            return content
    else:
        get_detail(url)



def get_master(max_behot_time,_signature):
    time.sleep(random.randint(1,5))
    response = requests.get(url=master_url.format(max_behot_time=max_behot_time,_signature=_signature))
    infos = json.loads(response.text)['data']
    # print(infos)
    if infos:
        for info in infos:
            # print(info)
            log_from = "".join(random.sample('abcdefghijklmnopqrstuvwxyz1234567890',13)).replace(' ','') + "_" + str(int(time.time() * 1000))
            article_url = (info['url'].replace("group/","a") + "?log_from=" + log_from).replace("http://toutiao.com","https://www.toutiao.com")
            title = info['title']
            print(article_url)
            print(title)
            content = get_detail(article_url)
            infos_list.append({'title':title,'content':content})
        print(infos_list)
        has_more = json.loads(response.text)['has_more']
        if has_more:
            max_behot_time = json.loads(response.text)['next']['max_behot_time']
            new_signature = get_signature(max_behot_time)
            get_master(max_behot_time,new_signature)
    else:
        get_master(max_behot_time,_signature)

if __name__ == "__main__":
    _signature = get_signature(0)
    get_master(0,_signature)
    # for a_ in a:
    #     print(a_['url'].split("?")[0])
    # content = get_detail("https://www.toutiao.com/a7042200035093709342/?log_from=3b88f64a8e013_1639703732731")
    # content = get_detail("https://www.toutiao.com/a7038850085706318350/?log_from=ekrfb3hcd9xsq_1639705126626")
    # print(content)
    #     infos_list.append({'title':a_['title'],'content':content})
    #     print(infos_list)
// 补环境(添加内容)
window = global;
window.document = {
    referrer: "https://so.toutiao.com/"
};
window.location = {
    hash: "",
    host: "www.toutiao.com",
    hostname: "www.toutiao.com",
    href: "https://www.toutiao.com/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/??tab=article?tab=article?tab=article",
    origin: "https://www.toutiao.com",
    pathname: "/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/",
    port: "",
    protocol: "https:"
}
window.navigator = {
    appCodeName: "Mozilla",
    appName: "Netscape",
    appVersion: "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
    cookieEnabled: true,
    deviceMemory: 8,
    doNotTrack: null,
    hardwareConcurrency: 16,
    language: "zh-CN",
    languages: ["zh-CN","zh"],
    maxTouchPoints: 0,
    onLine: true,
    platform: "Win32",
    product: "Gecko",
    productSub: "20030107",
    userAgent:"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
    vendor: "Google Inc.",
    vendorSub: ""
}

var glb;
(glb = "undefined" == typeof window ? global : window)._$jsvmprt = function(b, e, f) {
    function a() {
        if ("undefined" == typeof Reflect || !Reflect.construct)
            return !1;
        if (Reflect.construct.sham)
            return !1;
        if ("function" == typeof Proxy)
            return !0;
        try {
            return Date.prototype.toString.call(Reflect.construct(Date, [], (function() {}
            ))),
                !0
        } catch (b) {
            return !1
        }
    }
    function d(b, e, f) {
        return (d = a() ? Reflect.construct : function(b, e, f) {
                var a = [null];
                a.push.apply(a, e);
                var d = new (Function.bind.apply(b, a));
                return f && c(d, f.prototype),
                    d
            }
        ).apply(null, arguments)
    }
    function c(b, e) {
        return (c = Object.setPrototypeOf || function(b, e) {
                return b.__proto__ = e,
                    b
            }
        )(b, e)
    }
    function n(b) {
        return function(b) {
            if (Array.isArray(b)) {
                for (var e = 0, f = new Array(b.length); e < b.length; e++)
                    f[e] = b[e];
                return f
            }
        }(b) || function(b) {
            if (Symbol.iterator in Object(b) || "[object Arguments]" === Object.prototype.toString.call(b))
                return Array.from(b)
        }(b) || function() {
            throw new TypeError("Invalid attempt to spread non-iterable instance")
        }()
    }
    for (var i = [], r = 0, t = [], o = 0, l = function(b, e) {
        var f = b[e++]
            , a = b[e]
            , d = parseInt("" + f + a, 16);
        if (d >> 7 == 0)
            return [1, d];
        if (d >> 6 == 2) {
            var c = parseInt("" + b[++e] + b[++e], 16);
            return d &= 63,
                [2, c = (d <<= 8) + c]
        }
        if (d >> 6 == 3) {
            var n = parseInt("" + b[++e] + b[++e], 16)
                , i = parseInt("" + b[++e] + b[++e], 16);
            return d &= 63,
                [3, i = (d <<= 16) + (n <<= 8) + i]
        }
    }, u = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1], 16);
        return f = f > 127 ? -256 + f : f
    }, s = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16);
        return f = f > 32767 ? -65536 + f : f
    }, p = function(b, e) {
        var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3] + b[e + 4] + b[e + 5] + b[e + 6] + b[e + 7], 16);
        return f = f > 2147483647 ? 0 + f : f
    }, y = function(b, e) {
        return parseInt("" + b[e] + b[e + 1], 16)
    }, v = function(b, e) {
        return parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16)
    }, g = g || this || window, h = Object.keys || function(b) {
        var e = {}
            , f = 0;
        for (var a in b)
            e[f++] = a;
        return e.length = f,
            e
    }
             , m = (b.length,
            0), I = "", C = m; C < m + 16; C++) {
        var q = "" + b[C++] + b[C];
        q = parseInt(q, 16),
            I += String.fromCharCode(q)
    }
    if ("HNOJ@?RC" != I)
        throw new Error("error magic number " + I);
    m += 16;
    parseInt("" + b[m] + b[m + 1], 16);
    m += 8,
        r = 0;
    for (var w = 0; w < 4; w++) {
        var S = m + 2 * w
            , R = "" + b[S++] + b[S]
            , x = parseInt(R, 16);
        r += (3 & x) << 2 * w
    }
    m += 16,
        m += 8;
    var z = parseInt("" + b[m] + b[m + 1] + b[m + 2] + b[m + 3] + b[m + 4] + b[m + 5] + b[m + 6] + b[m + 7], 16)
        , O = z
        , E = m += 8
        , j = v(b, m += z);
    j[1];
    m += 4,
        i = {
            p: [],
            q: []
        };
    for (var A = 0; A < j; A++) {
        for (var D = l(b, m), T = m += 2 * D[0], $ = i.p.length, P = 0; P < D[1]; P++) {
            var U = l(b, T);
            i.p.push(U[1]),
                T += 2 * U[0]
        }
        m = T,
            i.q.push([$, i.p.length])
    }
    var _ = {
        5: 1,
        6: 1,
        70: 1,
        22: 1,
        23: 1,
        37: 1,
        73: 1
    }
        , k = {
        72: 1
    }
        , M = {
        74: 1
    }
        , H = {
        11: 1,
        12: 1,
        24: 1,
        26: 1,
        27: 1,
        31: 1
    }
        , J = {
        10: 1
    }
        , N = {
        2: 1,
        29: 1,
        30: 1,
        20: 1
    }
        , B = []
        , W = [];
    function F(b, e, f) {
        for (var a = e; a < e + f; ) {
            var d = y(b, a);
            B[a] = d,
                a += 2;
            k[d] ? (W[a] = u(b, a),
                a += 2) : _[d] ? (W[a] = s(b, a),
                a += 4) : M[d] ? (W[a] = p(b, a),
                a += 8) : H[d] ? (W[a] = y(b, a),
                a += 2) : J[d] ? (W[a] = v(b, a),
                a += 4) : N[d] && (W[a] = v(b, a),
                a += 4)
        }
    }
    return K(b, E, O / 2, [], e, f);
    function G(b, e, f, a, c, l, m, I) {
        null == l && (l = this);
        var C, q, w, S = [], R = 0;
        m && (C = m);
        var x, z, O = e, E = O + 2 * f;
        if (!I)
            for (; O < E; ) {
                var j = parseInt("" + b[O] + b[O + 1], 16);
                O += 2;
                var A = 3 & (x = 13 * j % 241);
                if (x >>= 2,
                A < 1) {
                    A = 3 & x;
                    if (x >>= 2,
                    A > 2)
                        (A = x) > 10 ? S[++R] = void 0 : A > 1 ? (C = S[R--],
                            S[R] = S[R] >= C) : A > -1 && (S[++R] = null);
                    else if (A > 1) {
                        if ((A = x) > 11)
                            throw S[R--];
                        if (A > 7) {
                            for (C = S[R--],
                                     z = v(b, O),
                                     A = "",
                                     P = i.q[z][0]; P < i.q[z][1]; P++)
                                A += String.fromCharCode(r ^ i.p[P]);
                            O += 4,
                                S[R--][A] = C
                        } else
                            A > 5 && (S[R] = h(S[R]))
                    } else if (A > 0) {
                        (A = x) > 8 ? (C = S[R--],
                            S[R] = typeof C) : A > 6 ? S[R] = --S[R] : A > 4 ? S[R -= 1] = S[R][S[R + 1]] : A > 2 && (q = S[R--],
                            (A = S[R]).x === G ? A.y >= 1 ? S[R] = K(b, A.c, A.l, [q], A.z, w, null, 1) : (S[R] = K(b, A.c, A.l, [q], A.z, w, null, 0),
                                A.y++) : S[R] = A(q))
                    } else {
                        if ((A = x) > 14)
                            z = s(b, O),
                                (U = function e() {
                                        var f = arguments;
                                        return e.y > 0 ? K(b, e.c, e.l, f, e.z, this, null, 0) : (e.y++,
                                            K(b, e.c, e.l, f, e.z, this, null, 0))
                                    }
                                ).c = O + 4,
                                U.l = z - 2,
                                U.x = G,
                                U.y = 0,
                                U.z = c,
                                S[R] = U,
                                O += 2 * z - 2;
 
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值