实例1--2019中国最好大学排名定向爬虫

最新推荐文章于 2023-02-25 00:04:19 发布

dadabaobaoren

最新推荐文章于 2023-02-25 00:04:19 发布

阅读量712

点赞数

文章标签： python 爬虫教程实例

本文链接：https://blog.csdn.net/dadabaobaoren/article/details/92803498

版权

从菜鸟开始学爬虫

实例1–2019中国最好大学排名定向爬虫
参考博客开始学习：https://www.cnblogs.com/Jerry-Dong/p/7647850.html
增加了一些简单功能

可以选择年份

Year = '2018'
url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming' + Year + '.html'

可以爬取标题

try:
    titleNode = soup.find('h3', class_='post-title')
    utitle = titleNode.getText()
except:
    utitle = '没有标题'

下面是完整代码-爬取前20名的大学排名：

# 2019中国最好大学定向爬虫
"""
Created on 2019-06-19
@author: DaDaBaoBaoRen
"""
import bs4
import requests
from bs4 import BeautifulSoup


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        # r.raise_for_status()  # 要检查请求是否成功，请使用 r.raise_for_status() 或者检查 r.status_code 是否和你的期望相同
        r.encoding = r.apparent_encoding  # 找出 Requests 使用了什么编码，并且能够使用r.encoding 属性来改变它
        return r.text
    except:
        return ''


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, 'html.parser')  # 指定Beautiful的解析器为“html.parser”
    try:
        titleNode = soup.find('h3', class_='post-title')
        utitle = titleNode.getText()
    except:
        utitle = '没有标题'
    print('{0:^40}'.format(utitle))
    for tr in soup.find('tbody').children:  # 返回一个上述列表的迭代器，也只有子节点
        if isinstance(tr, bs4.element.Tag):  # Tag类对象都有两个属性，name和attrs
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string])


def printUnivList(ulist, num):
    # {0:^10}中的0是一个序号，表示格式化输出的第0个字符，依次累加；
    # {0:^10}中的10表示输出宽度约束为10个字符；
    # {0:^10}中的^表示输出时居中对齐，若宽度小于字符串的实际宽度，以实际宽度输出
    tplt = '{0:^9}\t\t{1:^9}\t\t{2:^10}\t{3:^9}'
    print(tplt.format('排名', '学校名称', '省市', '总分'))  # 表头的后两个元素的槽宽度进行调整才对齐
    for i in range(num):
        u = ulist[i]
        print('{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}'.format(u[0], u[1], u[2], u[3], chr(12288)))  # 中文空白字符


def main():
    uinfo = []
    uhead = []
    Year = '2018'
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming' + Year + '.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)


main()

dadabaobaoren

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
实例1--2019中国最好大学排名定向爬虫

从菜鸟开始学爬虫实例1–2019中国最好大学排名定向爬虫参考博客开始学习：https://www.cnblogs.com/Jerry-Dong/p/7647850.html增加了一些简单功能可以选择年份Year = '2018'url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming' + Year + '.html'可以爬取标题...
复制链接

扫一扫