Python-实现简单爬取大学排名

最新推荐文章于 2022-06-12 02:16:47 发布

Caicaptain

最新推荐文章于 2022-06-12 02:16:47 发布

阅读量524

点赞数

分类专栏： # python 文章标签： python

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/qq_33053671/article/details/106644349

版权

python 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

一、实现功能

（仅适合教育目的爬取）从http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html获取大学排名，显示出来。

二、获取解析

通过浏览器访问该网址，右击网页，查看获取信息。使用的解析方法有多种，一种使用BeautifulSoup库，一种使用正则表达式直接匹配出来。我们这里使用BeautifulSoup。
会发现主要内容在tbody下面，td对应值。
在这里插入图片描述

				<td><div align="left">清华大学</div></td>
				<td>北京市</td><td>95.9</td><td class="hidden-xs need-hidden indicator5">100.0</td><td class="hidden-xs need-hidden indicator6"  style="display:none;">97.90%</td><td class="hidden-xs need-hidden indicator7"  style="display:none;">37342</td><td class="hidden-xs need-hidden indicator8"  style="display:none;">1.298</td><td class="hidden-xs need-hidden indicator9"  style="display:none;">1177</td><td class="hidden-xs need-hidden indicator10"  style="display:none;">109</td><td class="hidden-xs need-hidden indicator11"  style="display:none;">1137711</td><td class="hidden-xs need-hidden indicator12"  style="display:none;">1187</td><td class="hidden-xs need-hidden indicator13"  style="display:none;">593522</td></tr><tr><td>2</td>
				<td><div align="left">北京大学</div></td>

在这里插入图片描述

三、源码

import requests
import bs4  #pip3 install Beautifulsoup4 
from bs4 import BeautifulSoup 

def getHTMLTesxt(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding  #防止中文乱码
        return r.text
    except:
        return ""

def fillUnivList(uList, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:   #1.先找到tbody标签
        if isinstance(tr, bs4.element.Tag):  #判断tr的类型是bs4.element.Tag
            tds = tr('td')                   #2.从该标签下查找，0对应排名，1对应学校，2对应地区，3对应总分
            uList.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(uList, num):
    #tplt = "{:^10}\t{:^6}\t{:^10}"
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"#优化中文对齐1
    print(tplt.format("排名", "学校名称", "总分", chr(12288)))
    for i in range(num):
        u = uList[i]
        print("{:^10}\t{:^10}\t{:^10}".format(u[0], u[1], u[2]))

def main():
    uInfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLTesxt(url)
    fillUnivList(uInfo, html)
    printUnivList(uInfo, 20)
    print("-------------------")

main()