python爬虫+pygal交互式可视化爬取大学QS排名

搬金砖的程序员

已于 2022-01-28 21:57:08 修改

阅读量2k

点赞数 4

分类专栏：爬虫 python 作图文章标签： python 爬虫数据可视化

于 2022-01-26 14:22:26 首次发布

本文链接：https://blog.csdn.net/m0_61168705/article/details/122699759

版权

python 同时被 3 个专栏收录

3 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

作图

2 篇文章 0 订阅

订阅专栏

前言

不久后要做大学宣讲，顺便捡一下爬虫的语法，就写了这个爬虫的代码。记录一下免得以后又忘了。

一、用到的python库

requests（爬虫基本库）

re（正则表达式库，在本文中也可用内置的find方法代替）

pygal（交互式画图库）

以上库均可用"pip install name"（name即上述库的名称）安装

如果用的是anaconda环境管理，将pip改为conda

二、准备工作

1.找个有近几年QS排名的网站

我这里找的是QS在中国的官网：世界大学排名 | QSChina（但说实话数据有点少，只有4年的）

2.找到数据

通过检查找到网络后点击完整榜单，捕获数据。

找一下后发现数据在这个网址中，复制URL。（还有要加上自己浏览器的User-Agent进行伪装）

注意：这里请求的方法是GET但爬虫时用requests.get()方法会报错，应该用requests.post()。

3.分析json结构

打红框的就是我们要的目标数据，其中uni除了学校名是中文，其余均英文字符。

所以可以用re的语法提取出来：

reg=re.compile('[\u4E00-\u9FA5]+')

这个是匹配中文字符的表达式。

（但因为有的学校的名称是以英文或其他语言呈现的，所以最好爬取海外知名院校和中国院校）

三、编写代码

# This is a py file to crawl the ranks of your university in QS.
import requests
import re
import pygal

def getJson(url):
    r=requests.post(url,headers=headers).json()
    return r

def parse(dict):
    D={}
    reg=re.compile('[\u4E00-\u9FA5]+')
    # This line is to find the Chinese characters, so the university's name should be in Chinese, then you can get the QS rank of it. 
    for d in dict['data']:
        try:
            name=re.search(reg,d['uni']).group()
            # If the uni name is in English, it will return None. Use the group() to None will get an error.
            if name in uName:
                D[name]=[int(d['overall_rank'])]
            if len(D)==len(uName):
                break
        except:
            pass
    return D
                
def getuName():
    s=input("请将你想查的大学名称用中文输入（中间用‘，’隔开）：")
    #， is the comma in Chinese
    L=s.split('，')
    return L

def darwLine():
    line=pygal.Line()
    for i in uName:
        line.add(i,dataDict[i])
    line.x_labels=year
    line.y_title="QS排名"
    line.x_title="年份"
    line.title="近四年QS排名"
    line.legend_at_bottom=True
    line.render_to_file('learning scrapy\\crawl the ranks of QS\\查询结果.svg')
    # Before you run this file, you should change the above line to your path.

def main():
    global uName
    uName=getuName()
    for i in range(3,-1,-1):
        if i==3:
            global dataDict
            dataDict=parse(getJson(url[i]))
        else:
            D=parse(getJson(url[i]))
            for name in uName:
                dataDict[name].append(D[name][0])
    darwLine()

if __name__ == '__main__':
    url=['https://www.qschina.cn/sites/default/files/qs-rankings-data/cn/2122636_indicators.txt',
         'https://www.qschina.cn/sites/default/files/qs-rankings-data/cn/2057712_indicators.txt',
         'https://www.qschina.cn/sites/default/files/qs-rankings-data/cn/914824_indicators.txt',
         'https://www.qschina.cn/sites/default/files/qs-rankings-data/cn/397863_indicators.txt']
    headers={
        'user-agent': you need to add it
    }
    year=[2019,2020,2021,2022]
    main()