Python通过多线程模块threading并发获取分页数据并汇总保存到本地文件

Web接口数据抓取

本文通过使用Python脚本并发库threading并发获取分页数据,并汇总保存到本地文件。

需求背景

以网易红彩的专家页面为例,想确定一共有多少足球/篮球专家,能否把这些专家的名字列出来?

接口数据获取方法

  1. 打开测试网址https://hongcai.163.com/expert.html
  2. 打开浏览器开发者工具(比如chrome浏览器点击F12键)
  3. 刷新页面->查看XHR标签下方接口列表
  4. 逐个点击接口名称->查看Headers内容(注意观察Request URL)和Preview内容
    通过观察可知,
    1、名称为“20”的接口返回专家列表信息
    2、每个接口返回20条专家信息
    3、上滑页面到专家列表底部,发现共调用了8次“20”接口
    有了这些信息,就可以撸代码了。
    web接口抓取

Python环境

Python3.5

模拟单个接口请求

import requests
import json

def get_experts():
    host = 'hongcai.163.com'
    accept_encoding = 'gzip, deflate, sdch, br'
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    referer = 'https://hongcai.163.com/expert.html'
    cookie = r''
    headers = {'Accept-Encoding': accept_encoding,
               'User-Agent': user_agent,
               'Referer': referer,
               'Cookie': cookie,
               }
    method = 'expert/list'
    page_num = 0  # 第一页
    # https://hongcai.163.com/api/web/expert/list/1/0/20
    url = 'https://%s/api/web/%s/1/%s/20' % (host, method, page_num)

    resp = requests.get(url, headers=headers).content.decode()

    resp_json = json.loads(resp)
    # print(resp_json)

    e_list = resp_json['data']

    print('There are %s experts in first page:\n %s' % ((len(e_list), e_list)))

if __name__ == '__main__':
    get_experts()

响应数据分析

运行上述脚本得到的数据如下:

There are 20 experts in first page:
 [{'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170317/WRQrA9.jpg', 'maxWin': 12, 'slogan': '足球评论员', 'threadCount': 1, 'nickname': '张路', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1091}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171031/TGrvFX.jpg', 'maxWin': 21, 'slogan': '前国脚', 'threadCount': 1, 'nickname': '彭伟国', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 407882}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180905/udAlhO.jpg', 'maxWin': 11, 'slogan': '分析师', 'threadCount': 2, 'nickname': '农宗燊', 'hitRate': 0.71, 'bestWin': '近7场中5场', 'userId': 1857716}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180808/MTYxwd.jpg', 'maxWin': 17, 'slogan': '前国足队长', 'threadCount': 0, 'nickname': '马明宇', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1810118}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180521/g5ikAG.png', 'maxWin': 9, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '林伟洲', 'hitRate': 0.71, 'bestWin': '近7场中5场', 'userId': 585186}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170711/ksPHVg.jpg', 'maxWin': 23, 'slogan': '媒体从业者', 'threadCount': 2, 'nickname': '李胜', 'hitRate': 0.8, 'bestWin': '近5场中4场', 'userId': 201514}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20170922/EcjKAW.jpg', 'maxWin': 18, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '大白亚盘', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 343108}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170320/oGb4fz.jpg', 'maxWin': 24, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '戴维', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1879}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180524/2gBwn1.jpg', 'maxWin': 17, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '邓福', 'hitRate': 0.6, 'bestWin': '近5场中3场', 'userId': 579731}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171221/2Dy0RI.jpg', 'maxWin': 14, 'slogan': '足彩分析师', 'threadCount': 4, 'nickname': '川洋解盘', 'hitRate': 0.67, 'bestWin': '近3场中2场', 'userId': 1019}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180803/Xmmc37.jpg', 'maxWin': 10, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '张恺意大利', 'hitRate': 0.8, 'bestWin': '近5场中4场', 'userId': 1802551}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180521/5uOCd8.png', 'maxWin': 9, 'slogan': '知名解说', 'threadCount': 3, 'nickname': '江忠德', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 585164}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171120/FepTPm.jpg', 'maxWin': 41, 'slogan': '足球记者', 'threadCount': 3, 'nickname': '王涛', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 424247}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20170926/f5SEH1.jpg', 'maxWin': 29, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '王勤伯', 'hitRate': 1.0, 'bestWin': '近7场中7场', 'userId': 354416}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170704/tC9T47.jpg', 'maxWin': 23, 'slogan': '足彩分析师', 'threadCount': 0, 'nickname': '丹牛', 'hitRate': 1.0, 'bestWin': '近5场中5场', 'userId': 192068}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170612/WsTfSx.jpg', 'maxWin': 14, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': 'Rafa', 'hitRate': 0.5, 'bestWin': '近2场中1场', 'userId': 164244}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180710/kk2hSD.png', 'maxWin': 18, 'slogan': '外籍分析师', 'threadCount': 1, 'nickname': 'Jarkko', 'hitRate': 0.5, 'bestWin': '近2场中1场', 'userId': 117515}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170406/IznzMe.jpg', 'maxWin': 17, 'slogan': '足彩分析师', 'threadCount': 0, 'nickname': '足彩磐石', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 37395}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170405/niJthR.jpg', 'maxWin': 24, 'slogan': '足球解说员', 'threadCount': 5, 'nickname': '陈宁', 'hitRate': 0.67, 'bestWin': '近3场中2场', 'userId': 42287}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170401/RZPXxi.jpg', 'maxWin': 12, 'slogan': '外籍分析师', 'threadCount': 1, 'nickname': 'KD Shark', 'hitRate': 0.57, 'bestWin': '近7场中4场', 'userId': 35189}]
[Finished in 0.6s]

分析可知,专家列表中的每个字典包含一个专家信息
由于上述列表只是一个分页的数据,为了得到全部的数据,需要遍历所有的8个分页。考虑到效率问题,采用了threading模块进行并发处理。同时对单个接口进行了封装。

完整python代码编写

import threading
import requests
import json
import os


expert_list = []  # 收集各个分页的专家列表


class Mythread(threading.Thread):

    def __init__(self, obj, arg):
        super(Mythread, self).__init__()
        self.obj = obj
        self.arg = arg

    def run(self):
        # global expert_list
        ret = self.obj.get_experts(self.arg)
        # print(ret)
        if ret is not None:
            for i in range(len(ret)):
                expert_name = ret[i]['nickname']
                expert_list.append(expert_name)


class Hongcai(object):
    """docstring for Hongcai"""

    def __init__(self, user):
        self.user = user
        self.host = 'hongcai.163.com'
        self.accept_encoding = 'gzip, deflate, sdch, br'
        self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
        self.referer = 'https://hongcai.163.com/expert.html'
        self.cookie = r''

    def get_response(self, url):
        headers = {'Accept-Encoding': self.accept_encoding,
                   'User-Agent': self.user_agent,
                   'Referer': self.referer,
                   'Cookie': self.cookie,
                   }
        response = requests.get(url, headers=headers)

        return response.content.decode()

    def get_experts(self, page_num):
        method = 'expert/list'
        # https://hongcai.163.com/api/web/expert/list/1/80/20
        url = 'https://%s/api/web/%s/1/%s/20' % (self.host, method, page_num)

        resp = self.get_response(url=url)
        resp_json = json.loads(resp)
        # print(resp_json)

        e_list = resp_json['data']

        return e_list


if __name__ == '__main__':
    obj = Hongcai('user')
    # ret = jiumi.acc_list()
    # 如何确定并发数,即页数?
    thd_num = 8
    threads = []
    for i in range(thd_num):
        page_num = int('%s' % i) * 20
        thd = Mythread(obj, page_num)  # 第二个参数表示页数
        threads.append(thd)

    for thd in threads:
        thd.start()

    for thd in threads:
        thd.join()

    if not os.path.exists('./result'):
        os.mkdir('./result')

    with open('./result/data_experts_163.txt', 'w+') as fp:
        fp.write(str(expert_list))

    print('There are %s experts in total.' % len(expert_list))

重点内容回顾

  1. 接口分析
  2. 代码模拟请求
  3. 线程数设定:每个分页开一个线程
  4. 数据保存
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值