本文仅练习爬虫程序的编写,并无保存任何数据,网址接口已经打码处理。
我们通过分析网络请求可以看到有这两个json文件:
https://xxx.cn/www/2.0/schoolprovinceindex/2018/318/12/1/1.json
https://xxx..cn/www/2.0/schoolspecialindex/2018/31/11/1/1.json
其中318是学校id,12是省份id,代表的是天津
分别对应着学校各省分数线以及和各专业分数线
因此我们当前页面的代码为:
import requests
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;",
"Accept-Language": "zh-CN,zh;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
'Referer': 'https://xxx.cn/school/search'
}
url = 'https://xxx.cn/www/2.0/schoolprovinceindex/2018/1217/12/1/1.json'
response = requests.get(url,headers=HEADERS)
print(response.json())
接下来我们就要想办法获取学校id了,同样我们分析到:
https://xxxl.cn/gkcx/api/?uri=apigkcx/api/school/hotlists
通过post如下数据:
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":2,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
我们可以看到一个参数是page,对应着页码:
所以我们这部分的代码为:
import requests
HEADERS &