中国大学排名（定向爬虫实例代码）

最新推荐文章于 2024-08-15 14:07:32 发布

qq_40723809

最新推荐文章于 2024-08-15 14:07:32 发布

阅读量1.5k

点赞数

本文链接：https://blog.csdn.net/qq_40723809/article/details/87781417

版权

本文通过一个具体的实例，详细讲解如何使用定向爬虫技术抓取并分析中国大学的排名数据，包括设置爬虫规则、解析HTML、数据存储等关键步骤。

摘要由CSDN通过智能技术生成

import requests
from bs4 import BeautifulSoup
import bs4



def get_content(url,):
    try:
        user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36"
        response = requests.get(url,  headers={'User-Agent': user_agent})
        response.raise_for_status()   # 如果返回的状态码不是200， 则抛出异常;
        response.encoding = response.apparent_encoding  # 判断网页的编码格式， 便于respons.text知道如何解码;
    except Exception as e:
        print("爬取错误")
    else:

        print(response.url)
        print("爬取成功!")
        return  response.content



def getUnivList(html):
    """解析页面内容， 需要获取: 学校排名， 学校名称， 省份， 总分"""
    soup = BeautifulSoup(html, 'lxml')
    # 该页面只有一个表格， 也只有一个tbody标签;
    # 获取tbosy里面的所有子标签, 返回的是生成器： soup.find('tbody').children
    # 获取tbosy里面的所有子标签, 返回的是列表：   soup.find(