最好大学网中国大学排名爬虫疑问记录/原代码_嵩老师,中国最好大学排名代码,出不了排名了-CSDN博客

本文链接：https://blog.csdn.net/m0_73874976/article/details/139129565

本文主要记录这次爬虫中的问题

北京理工大学嵩老师的mooc课程《Python网络爬虫与信息提取》week2part6部分。实例：爬取最好大学网大学排名，在这里我使用的是2020年的排名链接。最好大学网链接link

原代码如下

# week2part6
import requests
import re
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ''

def fillUnivlist(html, ulist):
    soup = BeautifulSoup(html, 'html.parser')
    for tr in soup.find('tbody').children:
        # if isinstance(tr, bs4.element.Tag):
        #     tds = tr('td')
        #     ulist.append([tds[0].string, tds[1].find('a').string, tds[2].string, tds[3].string])
        tds = tr.find_all('td')
        if len(tds) >= 4:
            rank = tds[0].string if tds[0].string is not None else 'N/A'
            name = tds[1].find('a').string if tds[1].find('a') and tds[1].find('a').string is not None else 'N/A'
            city = tds[2].string if tds[2].string is not None else 'N/A'
            type_ = tds[3].string if tds[3].string is not None else 'N/A'
            score_ = tds[4].string if tds[4].string is not None else "N/A"
            ulist.append([rank, name, city, type_, score_])
    pass

def printUnivlist(ulist, num):
    print("{:^5}\t{:^10}\t{:^10}\t{:^10}\t{:^6}".format("排名", "学校名称", '城市', '类型', '分数'))
    for i in range(num):
        u = ulist[i]

        print("{:^5}\t{:^10}\t{:^10}\t{:^10}\t{:^6}".format(u[0].strip(),u[1].strip(),u[2].strip(),u[3].strip(), u[4].strip()))


def main():
    url = 'https://www.shanghairanking.cn/rankings/bcur/2020'
    html = getHTMLText(url)
    uinfo = []
    fillUnivlist(html, uinfo)
    printUnivlist(uinfo,20)

main()

运行结果截图
在这里插入图片描述

疑问

在这个案例中，要爬取的信息在html网页源代码中，主要在tbody的孩子结点tr中，每个tr结点包含一所大学的信息，在这里插入图片描述
对于每一个tr结点，它包含了6个孩子结点，即图片中的td结点,我的疑问在于，在网页中，明显第二、第三个td结点的Navigablestring 部分就是对应的城市和类型，但从爬虫结果看，显然爬取内容是’None‘，其余三项都正确爬取了。在其它相似内容的博客评论区也看到过类似的问题。
报错内容：

TypeError: unsupported format string passed to NoneType.__format

博文链接

a = u[0].strip() # 去掉字符串类两边的空格
AttributeError: ‘NoneType’ object has no attribute ‘strip’

评论的回复里有一个可试一试的办法

因为之前用string获取的地方是none，改为用text获取再用strip去空格

我修改了其中两行代码如下

            city = tds[2].text if tds[2].string is not None else 'N/A'
            type_ = tds[3].text if tds[3].string is not None else 'N/A'

结果仍然是None.

期待其它解决方案

其它记录

在这次实践中，为了达到理想的输出格式，用到了**strip()**这个函数,它是基本函数，作用是除去字符串首尾的填充字符，在默认参数下，作用是除去字符串首尾空格。