阳光高考院校库爬取
爬取网址:https://gaokao.chsi.com.cn/sch/search–ss-on,searchType-1,option-qg,start-0.dhtml
爬取内容如图所示:
代码`
根据观察可以看出网站第一页和第二页的网址是不同的,每一页的start都相差20,然后就是解析网页,用select获取需要爬取的内容,cookies填写自己的,我修改成了123
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4557.4 Safari/537.36',
'cookie':'123'
}
with open('阳光院校库.csv', "a", newline="", encoding='utf-8') as fp:
writer = csv.writer(fp)
header = ['院系名称', '院系所在地', '教育行政主管部门', '院校类型', '学历层次', '一流大学建设高校', '一流学科建设高校', '研究生院', '满意度']
writer.writerow(header)
fp.close()
for dex in range(0, 2780, 20):
url = f'https://gaokao.chsi.com.cn/sch/search--ss-on,option-qg,searchType-1,start-{dex}.dhtml'
html = requests.get(url,headers=head)
soup = BeautifulSoup(html.text,'lxml')
name = soup.select('td')
然后就将select获取的的内容,存到列表a中,并除去空格
a = []
for name in name:
a.append(name.get_text())
a = [x.strip() for x in a]
最后就是写入数据
for i in range(0, len(a), 9):
with open('阳光院校库.csv', "a", newline="", encoding='utf-8') as fp:
writer = csv.writer(fp)
writer.writerow([a[i], a[i + 1], a[i + 2], a[i + 3], a[i + 4], a[i + 5], a[i + 6], a[i + 7], a[i + 8]])
由于编码问题,文件中的数据都是这样的