中国大学排行榜网站的爬取

最新推荐文章于 2023-07-18 19:59:22 发布

45T

最新推荐文章于 2023-07-18 19:59:22 发布

阅读量1.2k

点赞数 3

分类专栏：爬虫文章标签： python xpath 列表爬虫数据分析

本文链接：https://blog.csdn.net/weixin_47818398/article/details/112545009

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫基本步骤

拿到要爬取网站的url，user-agent，必要时爬取其他网站时按要求加上cookie，proxies（固定操作）

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
    }

获取网页响应信息

	response = requests.get(url, headers=headers)
	# print(response.text)     text为str类型
    # print(response.content)  content为二进制

数据提取

# 将html源码创建成element对象
    # html.xpath()得到的数据类型是列表，列表里面的内容是element对象：该对象可当作字符串使用
    html = etree.HTML(response.content.decode('utf-8'))  # 二进制解码
    # etree.HTML()得到的数据是列表数据
    content_list_len = len(html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[2]/a'))
    universities_list = []
    for index in range(content_list_len):
        university = {
            "年份": year,
            "排名": html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[1]/text()')[index].replace(
                "\n          ", ''),
            "大学名": (html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[2]/a/text()')[index]).replace(
                "\n          ", ''),
            "省市": (html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[3]/text()')[index]).replace(
                "\n          ", ''),
            "种类": (html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[4]/text()')[index]).replace(
                "\n          ", ''),
            "总分": (html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr/td[5]/text()')[index]).replace(
                "\n          ", ''),
        }
        universities_list.append(university)

下载数据

def download(data, type1):
    with open(f'03_中国大学排行榜/{type1}.csv', "a", encoding="utf-8", newline="")as file:
        writerCsv = csv.writer(file)
        for item in data:  # item为字典
            writerCsv.writerow([item["年份"], item["排名"], item["大学名"], item["省市"], item["种类"], item["总分"]])

主函数

	years = [2016, 2017, 2018, 2019, 2020]
	search_list = [11, 21, 22, 23, 25, 24, 26, 30, 14, 13, 12, 10]
	type_list = ["中国大学排名（主榜)", "中国医药类大学排名", "中国财经类大学排名", "中国语言类大学排名", "中国政法类大学排名", "中国民族类大学排名", "中国体育类大学排名", "中国艺术类高校名单",
	             "中国合作办学大学排名", "中国独立学院排名", "中国民办高校排名", "中国大学排名（总榜)"]
	
	for year in years:
	    for type1 in type_list:
	        url = f'https://www.shanghairanking.cn/rankings/bcur/{year}' + str(search_list[type_list.index(type1)])
	        data = China_university(url, type1, year)
	        download(data, type1)
	    print(f'{year}年加载完成！')

45T

关注

3
点赞
踩
11

收藏

觉得还不错? 一键收藏
4
评论
中国大学排行榜网站的爬取

爬虫基本步骤拿到要爬取网站的url，user-agent，必要时爬取其他网站时按要求加上cookie，proxies（固定操作） headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36", }获取网页响应信息 response = request
复制链接

扫一扫

专栏目录