Python学习之爬虫(一)--获取论坛中学生获得offer信息

本文链接：https://blog.csdn.net/OAOiii/article/details/102486286

Python学习之爬虫(一)–获取论坛中学生获得offer信息

目的：爬取帖子中每个学生的offer信息，并保存在Excel文档中。

爬取结果

在这里插入图片描述

过程

1. 爬取每个帖子中学生offer信息

1.1 查看HTML，找到论坛中所有帖子的链接

1.2 遍历链接，爬取每个帖子中个人信息以及offer信息

由于帖子中无姓名等标识个人身份的数据，因此用发帖人id号（唯一）来标识它的个人信息和offer信息。存放在数组第一列。
一个人可能有多个offer。
offer信息:必有‘申请学校’、‘学位’、‘专业’、‘申请结果’、‘入学年份’、‘入学学期’、‘通知时间’等信息，且必按顺序。每一个offer结束项必为通知时间。当遇到通知时间时，该offer采集结束，开始下一offer信息采集。
个人信息:选填‘TOEFL’、‘IELTS ’、‘本科学校档次’、‘本科专业’、‘本科成绩和算法、排名’，虽然信息有一定顺序，但不是必填内容。因此，固定数组位置存放个人信息。

2. 保存在Excel中

由于希望在存储时，数据更加整洁，便于以后统计。

一条offer对应一条个人信息。

格式为:id号+个人信息+一条offer

采用追加方式将信息存到Excel中。

代码

import re
from time import sleep
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import openpyxl

xl = '6.xlsx'#xlsx名字

#单个帖子的信息爬取
def getInfo(href):
    resp0 = urlopen(href)
    t_data = resp0.read().decode('GBK')
    tSoup = bs(t_data, 'html.parser')

    wb = openpyxl.load_workbook(xl)
    ws = wb.active  # 激活worksheet


    flag = 0    # offer = 0；Info = 1
    offer = []  # 单个offer
    infos = [None] * 5  # 学生信息
    offers = []  # 单个学生所有offer

    try:
        uid = tSoup.find('span', {'class': 'uid'}).text
        infos[0] = uid[4:]  # 获取用户id，并将用户id放置学生信息数组第一位
    except Exception as e:
        print("出现异常："+str(e))

    tInfo = tSoup.find('div', {'class': 'typeoption'}).findAll('tr')
    for i in tInfo:
        try:
            tKey = i.th.text    #TOEFL、本科学校档次等
            tValue = i.td.text  #与tKey相对应，用户填写的资料
        except Exception as e:
            print("出现异常：" + str(e))

        if (tKey == 'TOEFL:')|(tKey == 'IELTS:') | (tKey == '本科学校档次:') | (tKey == '本科专业:') | (tKey == '本科成绩和算法、排名:'):
            flag = 1 #代表现获取内容为学生信息

        if flag:  # 分当前个人信息还是offer信息
            #去除空白符，将信息填入学生信息数组的相应位置
            if (tKey == 'TOEFL:')|(tKey == 'IELTS:'):
                infos[1] = ''.join(tValue.split())
            if tKey == '本科学校档次:':
                infos[2] = ''.join(tValue.split())
            if tKey == '本科专业:':
                infos[3] = ''.join(tValue.split())
            if tKey == '本科成绩和算法、排名:':
                infos[4] = ''.join(tValue.split())
        else:
            offer.append(''.join(tValue.strip()))  # offer信息
            if (tKey == '通知时间:'):#通知时间代表一个offer信息爬取的结束点
                offers.append(offer)  # 将当前offer信息加入到offers[]中
                offer = []  # 清空数组

    student = []  # 规范数据：一条offer对应一个学生
    for o in offers:
        student = infos + o
        ws.append(student)
    wb.save(xl)

#自动翻页
for i in range(1,3):
    sleep(0.5)
    url = '论坛网址'+str(i)
    #英国：fid=486，美国：fid=49，澳洲新西兰:fid=128，港澳台：fid=811
    resp = urlopen(url)
    html_data = resp.read()

    soup = bs(html_data, "html.parser")

    linkElems = []#每页所有帖子的链接
    table0 = soup.find(id='forum_811').findAll('tbody', {'id': re.compile(r'^normalthread_')})
    for i in table0:
        try:
            j = i.find('tr').find('th').find('a', {'class': 'xst'}).get('href')#帖子链接
            linkElems.append(j)
        except Exception as e:
            print("出现异常：" + str(e))

    # 遍历帖子链接
    for i in linkElems:
        # print(i)
        try:
            getInfo(i)#获取每一个帖子的信息
        except Exception as e:
            print("出现异常：" + str(e))

不足之处

没有实现根据发帖时间来获取信息，单纯在论坛中按发帖时间来确定要爬取帖子的页面范围。
可以根据网址中的fid来确定各国或地区留学板块页面，每个页面的所有帖子链接也放在HTML中的id=’forum_‘+fid（如：香港地区的fid为811，id=‘forum_811’）。若放进代码中，将更自动化。（？）
爬取过程中出现的错误还未解决。（比如：如何避免出现该错误信息）