Web接口数据抓取
本文通过使用Python脚本并发库threading并发获取分页数据,并汇总保存到本地文件。
需求背景
以网易红彩的专家页面为例,想确定一共有多少足球/篮球专家,能否把这些专家的名字列出来?
接口数据获取方法
- 打开测试网址https://hongcai.163.com/expert.html
- 打开浏览器开发者工具(比如chrome浏览器点击F12键)
- 刷新页面->查看XHR标签下方接口列表
- 逐个点击接口名称->查看Headers内容(注意观察Request URL)和Preview内容
通过观察可知,
1、名称为“20”的接口返回专家列表信息
2、每个接口返回20条专家信息
3、上滑页面到专家列表底部,发现共调用了8次“20”接口
有了这些信息,就可以撸代码了。
Python环境
Python3.5
模拟单个接口请求
import requests
import json
def get_experts():
host = 'hongcai.163.com'
accept_encoding = 'gzip, deflate, sdch, br'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
referer = 'https://hongcai.163.com/expert.html'
cookie = r''
headers = {'Accept-Encoding': accept_encoding,
'User-Agent': user_agent,
'Referer': referer,
'Cookie': cookie,
}
method = 'expert/list'
page_num = 0 # 第一页
# https://hongcai.163.com/api/web/expert/list/1/0/20
url = 'https://%s/api/web/%s/1/%s/20' % (host, method, page_num)
resp = requests.get(url, headers=headers).content.decode()
resp_json = json.loads(resp)
# print(resp_json)
e_list = resp_json['data']
print('There are %s experts in first page:\n %s' % ((len(e_list), e_list)))
if __name__ == '__main__':
get_experts()
响应数据分析
运行上述脚本得到的数据如下:
There are 20 experts in first page:
[{'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170317/WRQrA9.jpg', 'maxWin': 12, 'slogan': '足球评论员', 'threadCount': 1, 'nickname': '张路', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1091}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171031/TGrvFX.jpg', 'maxWin': 21, 'slogan': '前国脚', 'threadCount': 1, 'nickname': '彭伟国', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 407882}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180905/udAlhO.jpg', 'maxWin': 11, 'slogan': '分析师', 'threadCount': 2, 'nickname': '农宗燊', 'hitRate': 0.71, 'bestWin': '近7场中5场', 'userId': 1857716}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180808/MTYxwd.jpg', 'maxWin': 17, 'slogan': '前国足队长', 'threadCount': 0, 'nickname': '马明宇', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1810118}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180521/g5ikAG.png', 'maxWin': 9, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '林伟洲', 'hitRate': 0.71, 'bestWin': '近7场中5场', 'userId': 585186}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170711/ksPHVg.jpg', 'maxWin': 23, 'slogan': '媒体从业者', 'threadCount': 2, 'nickname': '李胜', 'hitRate': 0.8, 'bestWin': '近5场中4场', 'userId': 201514}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20170922/EcjKAW.jpg', 'maxWin': 18, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '大白亚盘', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 343108}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170320/oGb4fz.jpg', 'maxWin': 24, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '戴维', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 1879}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180524/2gBwn1.jpg', 'maxWin': 17, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': '邓福', 'hitRate': 0.6, 'bestWin': '近5场中3场', 'userId': 579731}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171221/2Dy0RI.jpg', 'maxWin': 14, 'slogan': '足彩分析师', 'threadCount': 4, 'nickname': '川洋解盘', 'hitRate': 0.67, 'bestWin': '近3场中2场', 'userId': 1019}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180803/Xmmc37.jpg', 'maxWin': 10, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '张恺意大利', 'hitRate': 0.8, 'bestWin': '近5场中4场', 'userId': 1802551}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180521/5uOCd8.png', 'maxWin': 9, 'slogan': '知名解说', 'threadCount': 3, 'nickname': '江忠德', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 585164}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20171120/FepTPm.jpg', 'maxWin': 41, 'slogan': '足球记者', 'threadCount': 3, 'nickname': '王涛', 'hitRate': 1.0, 'bestWin': '近3场中3场', 'userId': 424247}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20170926/f5SEH1.jpg', 'maxWin': 29, 'slogan': '媒体从业者', 'threadCount': 0, 'nickname': '王勤伯', 'hitRate': 1.0, 'bestWin': '近7场中7场', 'userId': 354416}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170704/tC9T47.jpg', 'maxWin': 23, 'slogan': '足彩分析师', 'threadCount': 0, 'nickname': '丹牛', 'hitRate': 1.0, 'bestWin': '近5场中5场', 'userId': 192068}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170612/WsTfSx.jpg', 'maxWin': 14, 'slogan': '足彩分析师', 'threadCount': 1, 'nickname': 'Rafa', 'hitRate': 0.5, 'bestWin': '近2场中1场', 'userId': 164244}, {'hasFollowed': False, 'avatar': 'https://relottery.nosdn.127.net/user/20180710/kk2hSD.png', 'maxWin': 18, 'slogan': '外籍分析师', 'threadCount': 1, 'nickname': 'Jarkko', 'hitRate': 0.5, 'bestWin': '近2场中1场', 'userId': 117515}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170406/IznzMe.jpg', 'maxWin': 17, 'slogan': '足彩分析师', 'threadCount': 0, 'nickname': '足彩磐石', 'hitRate': 1.0, 'bestWin': '近2场中2场', 'userId': 37395}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170405/niJthR.jpg', 'maxWin': 24, 'slogan': '足球解说员', 'threadCount': 5, 'nickname': '陈宁', 'hitRate': 0.67, 'bestWin': '近3场中2场', 'userId': 42287}, {'hasFollowed': False, 'avatar': 'https://nos.netease.com/relottery/user/20170401/RZPXxi.jpg', 'maxWin': 12, 'slogan': '外籍分析师', 'threadCount': 1, 'nickname': 'KD Shark', 'hitRate': 0.57, 'bestWin': '近7场中4场', 'userId': 35189}]
[Finished in 0.6s]
分析可知,专家列表中的每个字典包含一个专家信息
由于上述列表只是一个分页的数据,为了得到全部的数据,需要遍历所有的8个分页。考虑到效率问题,采用了threading模块进行并发处理。同时对单个接口进行了封装。
完整python代码编写
import threading
import requests
import json
import os
expert_list = [] # 收集各个分页的专家列表
class Mythread(threading.Thread):
def __init__(self, obj, arg):
super(Mythread, self).__init__()
self.obj = obj
self.arg = arg
def run(self):
# global expert_list
ret = self.obj.get_experts(self.arg)
# print(ret)
if ret is not None:
for i in range(len(ret)):
expert_name = ret[i]['nickname']
expert_list.append(expert_name)
class Hongcai(object):
"""docstring for Hongcai"""
def __init__(self, user):
self.user = user
self.host = 'hongcai.163.com'
self.accept_encoding = 'gzip, deflate, sdch, br'
self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
self.referer = 'https://hongcai.163.com/expert.html'
self.cookie = r''
def get_response(self, url):
headers = {'Accept-Encoding': self.accept_encoding,
'User-Agent': self.user_agent,
'Referer': self.referer,
'Cookie': self.cookie,
}
response = requests.get(url, headers=headers)
return response.content.decode()
def get_experts(self, page_num):
method = 'expert/list'
# https://hongcai.163.com/api/web/expert/list/1/80/20
url = 'https://%s/api/web/%s/1/%s/20' % (self.host, method, page_num)
resp = self.get_response(url=url)
resp_json = json.loads(resp)
# print(resp_json)
e_list = resp_json['data']
return e_list
if __name__ == '__main__':
obj = Hongcai('user')
# ret = jiumi.acc_list()
# 如何确定并发数,即页数?
thd_num = 8
threads = []
for i in range(thd_num):
page_num = int('%s' % i) * 20
thd = Mythread(obj, page_num) # 第二个参数表示页数
threads.append(thd)
for thd in threads:
thd.start()
for thd in threads:
thd.join()
if not os.path.exists('./result'):
os.mkdir('./result')
with open('./result/data_experts_163.txt', 'w+') as fp:
fp.write(str(expert_list))
print('There are %s experts in total.' % len(expert_list))
重点内容回顾
- 接口分析
- 代码模拟请求
- 线程数设定:每个分页开一个线程
- 数据保存