Python 通过知乎热榜api端口爬取数据
前几天写了份知乎网页版爬取热榜信息的帖子,思来想去,过于繁琐,于是采用抓取知乎热榜api端口的方式进行解析,果不其然,心情舒畅了。
控制台打印输出时,会有一项关于SSL证书的警告,无需理会,如下:
InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.zhihu.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn
项目代码如下,至于如何抓取api端口,各位大佬可自行百度,也是比较简单。
import requests
import json
# 构建请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',}
# 知乎热榜api网址
url = 'https://api.zhihu.com/topstory/hot-list?limit=10&reverse_order=0'
# 发送请求 并返回数据
resp = requests.get(url = url, headers = headers, verify=False)
# 提取数据
for i in range(0, 50):
hot = resp.json()['data'][i]['detail_text'] # 热度值
title = resp.json()['data'][i]['target']['title'] # 标题
link_url = resp.json()['data'][i]['target']['url'] # 链接地址
# 转化为网页版的网址进行查看
link_url = str(link_url).replace('api', 'www').replace('questions', 'question')
excerpt = resp.json()['data'][i]['target']['excerpt'] # 简述信息
print('{} {} {} \n{}\n\n'.format(hot, title, link_url, excerpt)) # 格式化输出