【爬虫实战】5Python网络爬虫——中国大学排名定向爬虫

最新推荐文章于 2024-05-24 21:22:08 发布

Yang SiCheng

最新推荐文章于 2024-05-24 21:22:08 发布

阅读量1.4k

点赞数 6

分类专栏：【爬虫】文章标签： python 爬虫 mooc

本文链接：https://blog.csdn.net/qq_41897800/article/details/112756768

版权

【爬虫】专栏收录该内容

10 篇文章 3 订阅

订阅专栏

中国大学排名定向爬虫

1、中国大学排名定向爬虫”实例介绍
2、“中国大学排名定向爬虫”实例编写

内容参考自北京理工大学MOOC：Python网络爬虫与信息提取

1、中国大学排名定向爬虫”实例介绍

背景：由上海软科高等教育评价，每年对会进行最好大学、最好学科等排名

功能描述：

输入：大学排名URL链接
输出：大学排名信息的屏幕输出（排名，大学名称，总分）
技术路线：requests‐bs4
定向爬虫：仅对输入URL进行爬取，不扩展爬取

定向爬虫可行性

F12可以看见
https://www.shanghairanking.cn/robots.txt

程序的结构设计：

步骤1：从网络上获取大学排名网页内容——getHTMLText()
步骤2：提取网页内容中信息到合适的数据结构——fillUnivList()
步骤3：利用数据结构展示并输出结果——printUnivList()

2、“中国大学排名定向爬虫”实例编写

对于每一个tbody底下的tr，对应一个大学：
在这里插入图片描述
tr的每一个td对应一个属性：

问题1：isinstance函数：

意思是“判断类型”；isinstance()是一个内置函数，用于判断一个对象是否是一个已知的类型，类似type()。例如：

a=2
isinstance (a,int)		# True
isinstance (a,str)		# False
isinstance (a,(str,int,list))    # 是元组中的一个则返回True

问题2：关于输出的格式：
在这里插入图片描述
问题3：运行时报错：

TypeError: unsupported format string passed to NoneType.__format__

放一个例子就懂了：

html = '<td data-v-2a8fd7e4="" class="align-left"><a data-v-2a8fd7e4="" href="/institution/tsinghua-university" class="">清华大学</a></td>'
html2 = '<td data-v-2a8fd7e4="" class="align-left"><a data-v-2a8fd7e4="" href="/institution/tsinghua-university" class="">清华大学</a><p></p></td>'
html3 = '<td data-v-2a8fd7e4="" class="align-left"><a data-v-2a8fd7e4="" href="/institution/tsinghua-university" class="">清华大学</a><!----></td>'
soup = BeautifulSoup(html,'html.parser')
soup2 = BeautifulSoup(html2,'html.parser')
soup3 = BeautifulSoup(html3,'html.parser')
print(soup.string)
print(soup2.string)
print(soup3.string)
print(soup.text)
print(soup2.text)
print(soup3.text)

结果：

清华大学
None
None
清华大学
清华大学
清华大学

分析：

<p></p>

和

<!---->

对应的值为None，参考bs4 中 string 属性和 text 属性的区别及背后的原理以及问题解决：TypeError: unsupported format string passed to NoneType.format，简单地说就是把string换成text就行了

问题4：输出格式问题
输出有很多空格和回车，可使用.strip()函数，如下所示：

str1 = '     \n  abc     \n    d   \n'
print(str1)
print(str1.strip())   # abc

运行结果如下所示：

运行很完美，代码如下所示：

import requests
from bs4 import BeautifulSoup


uinfo = []


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print('Error')
        return ''


def fillUnivList(ulist, html):
    import bs4
    soup = BeautifulSoup(html, 'html.parser')
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):     # 如果tr确实是一个Tag
            tds = tr('td')      # 对于一个大学的所有属性td
            ulist.append([tds[0].text.strip(), tds[1].text.strip(), tds[4].text.strip()])


def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))



def main():
    url = 'https://www.shanghairanking.cn/rankings/bcur/2020'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)  # 20 univs


main()

结果：

    排名    	 学校名称 	    总分    
    1     	 清华大学 	  852.5   
    2     	 北京大学 	  746.7   
    3     	 浙江大学 	  649.2   
    4     	上海交通大学	  625.9   
    5     	 南京大学 	  566.1   
    6     	 复旦大学 	  556.7   
    7     	中国科学技术大学	  526.4   
    8     	华中科技大学	  497.7   
    9     	 武汉大学 	   488    
    10    	 中山大学 	  457.2   
    11    	西安交通大学	  452.5   
    12    	哈尔滨工业大学	  450.2   
    13    	北京航空航天大学	  445.1   
    14    	北京师范大学	  440.9   
    15    	 同济大学 	   439    
    16    	 四川大学 	  435.7   
    17    	 东南大学 	  432.7   
    18    	中国人民大学	  409.7   
    19    	 南开大学 	  402.1   
    20    	北京理工大学	  395.6

但还是有点没对齐，下面来解决一下这个问题

中文对齐问题的原因：当中文字符宽度不够时，采用西文字符填充；中西文字符占用宽度不同。解决方法：采用中文字符的空格填充chr(12288)

将printUnivList函数改为以下形式，template中0、1、2、3对应于format的四个属性。

def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

最后结果：

   排名    	　　　学校名称　　　	    总分    
    1     	　　　清华大学　　　	  852.5   
    2     	　　　北京大学　　　	  746.7   
    3     	　　　浙江大学　　　	  649.2   
    4     	　　上海交通大学　　	  625.9   
    5     	　　　南京大学　　　	  566.1   
    6     	　　　复旦大学　　　	  556.7   
    7     	　中国科学技术大学　	  526.4   
    8     	　　华中科技大学　　	  497.7   
    9     	　　　武汉大学　　　	   488    
    10    	　　　中山大学　　　	  457.2   
    11    	　　西安交通大学　　	  452.5   
    12    	　哈尔滨工业大学　　	  450.2   
    13    	　北京航空航天大学　	  445.1   
    14    	　　北京师范大学　　	  440.9   
    15    	　　　同济大学　　　	   439    
    16    	　　　四川大学　　　	  435.7   
    17    	　　　东南大学　　　	  432.7   
    18    	　　中国人民大学　　	  409.7   
    19    	　　　南开大学　　　	  402.1   
    20    	　　北京理工大学　　	  395.6   

Process finished with exit code 0

最后总分那儿还是感觉很丑……

Yang SiCheng

关注

6
点赞
踩
21

收藏

觉得还不错? 一键收藏
3
评论
【爬虫实战】5Python网络爬虫——中国大学排名定向爬虫

中国大学排名定向爬虫1、中国大学排名定向爬虫”实例介绍2、“中国大学排名定向爬虫”实例编写内容参考自北京理工大学MOOC：Python网络爬虫与信息提取1、中国大学排名定向爬虫”实例介绍背景：由上海软科高等教育评价，每年对会进行最好大学、最好学科等排名功能描述：输入：大学排名URL链接输出：大学排名信息的屏幕输出（排名，大学名称，总分）技术路线：requests‐bs4定向爬虫：仅对输入URL进行爬取，不扩展爬取定向爬虫可行性F12可以看见https://www.shanghai
复制链接

扫一扫