成果展示(篇幅原因,展示部分,全国共两千多所高校):
![609b666e7a19fac9daf7bae91f3a646a.png](https://img-blog.csdnimg.cn/img_convert/609b666e7a19fac9daf7bae91f3a646a.png)
这是原网页数据:
![f2d7670cf0586393508a35a0ee86c0fc.png](https://img-blog.csdnimg.cn/img_convert/f2d7670cf0586393508a35a0ee86c0fc.png)
思路:
查看网页源码发现为固定数据,非异步请求,所以呢就直接构造连接了
![13729ebe51ce371afc33826fbac62390.png](https://img-blog.csdnimg.cn/img_convert/13729ebe51ce371afc33826fbac62390.png)
通过对比发现需要构造处就是红框部分,依次增加20
使用xpath获取表格类数据比较方便
源码:
import requestsfrom lxml import etreeimport openpyxltitle = ['院校名称', '院校所在地', '教育主管部门', '院校类型', '学历层次', '满意度']workbook = openpyxl.Workbook()sheet = workbook.worksheets[0]sheet.append(title)def writefile(school, destination, party, schooltype, floattype, score): for i in range(len(school)): sheet.append([school[i], destination[i], party[i], schooltype[i], floattype[i], score[i]])def replacet(who): for i in range(len(who)): who[i] = who[i].replace(' ', '').replace('', '') return whodef get(url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/78.0.3904.97 Safari/537.36", } response = requests.get(url, headers=headers).text html = etree.HTML(response, etree.HTMLParser()) school = html.xpath('//div/table/tr/td[1]/a/text()') destination = html.xpath('//div/table/tr/td[2]/text()') party = html.xpath('//div/table/tr/td[3]/text()') schooltype = html.xpath('//div/table/tr/td[4]/text()') floattype = html.xpath('//div/table/tr/td[5]/text()') score = html.xpath('//div/table/tr/td[9]/a/text()') school = replacet(school) destination = replacet(destination) party = replacet(party) schooltype = replacet(schooltype) floattype = replacet(floattype) score = replacet(score) writefile(school, destination, party, schooltype, floattype, score)if __name__ == '__main__': for p in range(0, 2820, 20): print('第{}个开始'.format(p)) try: get('https://gaokao.chsi.com.cn/sch/search--ss-on,searchType-1,option-qg,start-{}.dhtml'.format(p)) print('第{}个保存完成'.format(p)) except: print('第{}个保存失败'.format(p)) workbook.save('2020高考高校信息库.xlsx') workbook.close()
完成!