py爬虫，爬取codeforces分数

最新推荐文章于 2022-12-20 14:52:54 发布

Keep--Silent

最新推荐文章于 2022-12-20 14:52:54 发布

阅读量768

点赞数 1

分类专栏： py 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_45543594/article/details/120175048

版权

py 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

py爬取codeforces分数

爬取过程：
代码
- 1.0版本：
- 2.0版本

爬取过程：

py伪装成浏览器，爬取整个网页的代码
用bs解析html代码
找到需要的数据
提取数据

1.首先是用getData获取需要的网页的代码，为了伪装成是浏览器，需要header头部，要不然就是明明白白的报文：我是python，这样肯定是不行的。
2. bs, 把html解析成特定的结构，这样方便接下来查找数据。
3. bs.select筛出需要的部分
4. 最后用正则表达式提取需要的部分（不会正则表达式，自己写了一个myre）

附：bs.select的查找方法

在这里插入图片描述

代码

1.0版本：

from bs4 import BeautifulSoup
from urllib import request
import urllib.request, urllib.error  # 指定URL,获取网页数据
import urllib


def getData(baseurl):
    # 解析数据
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/103'
    }
    req = urllib.request.Request(baseurl, headers=headers)
    try:
        response = urllib.request.urlopen(req)
        data = response.read().decode("utf-8")
        # print(data)
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        return 'Error'


def myre(s):
    flag = 0;
    ans = ""
    for i in range(1, len(s) - 1):
        if s[i] == '>' and s[i - 1] == '\"':
            flag = 1
        elif flag == 1:
            if s[i] == '<' and s[i + 1] == '/':
                return ans
            else:
                ans += s[i]


def get_rating(name):
    baseurl = "http://codeforces.com/profile/" + name
    data = getData(baseurl)
    bs = BeautifulSoup(data, "html.parser")
    # print(bs)
    temp = bs.select('#pageContent > div:nth-child(3) > div.userbox > div.info > ul > li:nth-child(1)')
    s = str(temp)
    # print(s)
    rating = myre(s)
    if rating is None:
        return "None"
    # print(rating)
    else:
        return rating


if __name__ == '__main__':
    #
    name = "tourist"
    while 1 == 1:
        rating = get_rating(name)
        print(rating)
        name = input()
#   get_rating返回str类型
#   用户名存在则返回分数，不存在返回“None"

2.0版本

import re

from urllib import request
import urllib.request, urllib.error  # 指定URL,获取网页数据
import urllib


def getData(baseurl):
    # 解析数据
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/103'
    }
    req = urllib.request.Request(baseurl, headers=headers)
    try:
        response = urllib.request.urlopen(req)
        data = response.read().decode("utf-8")
        # print(data)
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        return 'Error'


def myre(s):
    v = re.findall("rating\":(.+?),", s)
    if len(v) == 0:
        return None
    else:
        return v[0]


def get_rating(name):
    baseurl = "https://codeforces.com/api/user.info?handles=" + name
    data = getData(baseurl)
    s = str(data)
    # print(s)
    rating = myre(s)
    if rating is None:
        return "None"
    # print(rating)
    else:
        return rating


if __name__ == '__main__':
    name = "jiangly"
    while 1 == 1:
        rating = get_rating(name)
        print(rating)
        name = input()
#   get_rating返回str类型
#   用户名存在则返回分数，不存在返回“None"

Keep--Silent

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
py爬虫，爬取codeforces分数

爬取过程：py伪装成浏览器，爬取整个网页的代码用bs解析html代码找到需要的数据提取数据from bs4 import BeautifulSoupfrom urllib import requestimport urllib.request, urllib.error # 指定URL,获取网页数据import urllibdef getData(baseurl): # 解析数据 headers = { 'User-Agent': 'Mozil
复制链接

扫一扫

专栏目录