爬虫 - 抓取52论坛帖子列表

最新推荐文章于 2024-05-01 17:16:41 发布

云舒轻寒

最新推荐文章于 2024-05-01 17:16:41 发布

阅读量863

点赞数 12

分类专栏： Python 文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/a397802230/article/details/125991135

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1. 前言

这两周稍微得了点空闲，又对爬虫有相当兴趣，PythonPycharm都是现成的，说干就干。
从需求出发，起初是想做个爬图的程序，下点动漫美图什么的，非常实用。网站和图片URL都抓好了，结果发现得登录才能下载。关于登录的程序目前对我还有点超纲，只好先放一放，找点简单的做做，就简单抓点文字算了。
顺带一提，request库和BeautifulSoup库似乎都只能拿到网页源码而非元素，而很多时候源码和元素长得又不一样，我实在想知道怎么提取网页的元素，就F12显示的那个。目前我只好以源码为准。
偶尔会逛52论坛，看看有没有发布什么实用的工具，于是考虑批量扒一下帖子标题和链接，形成表格。看了一下源码，结构还算清晰。

首页的几个主要板块：新鲜出炉、技术分享、人气热门、精华采撷。
图就不放了，会被搞。

2. 代码

# 编写环境：Python 3.8.5 + PyCharm 2022.1.4 (Community Edition)
# 记得修改网址！文中的“xxxxxxx”！

import requests #爬虫
import time     #延时
import re       #正则表达式
import openpyxl #保存至表格文件
import os       #打开文件
from openpyxl.styles import Font,Alignment    #修改表格格式

# 使用时可能需要修改的部分：Cookie、输出文件路径、52网址。

# 获取网页。入参=板块名/板块主链接/板块页数，出参=含有帖子标题/帖子编号/帖子链接的二维列表。
def RequestWeb_52(section_name, main_url, pages):
    print('----------开始抓取{}板块！----------'.format(section_name))
    time.sleep(1)
    # 意味着板块页数上限为10，超出会报错
    post_url = [[], [], [], [], [], [], [], [], [], []]
    post_info = [[], [], [], [], [], [], [], [], [], []]
    for page in range(0, pages):
        time.sleep(1)
        web_url = main_url + '&page=' + str(page+1) + '.html'
        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                 'AppleWebKit/537.36 (KHTML, like Gecko) '
                                 'Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71',
                   'Cookie': 'htVC_2132_saltkey=c5l7HvVS; htVC_2132_lastvisit=1658727423; '
                             'wzws_cid=b9aba2f4673e085e665772733d04f8f46c133784f5725df85be854e0b7fe8e61873a747a11841693c1c14ee847ff3c2db5a45228b783f31332c723dde5db273580fd7af6f83df369943e1c6bc3be579b; '
                             'htVC_2132_lastact=1658742319%09forum.php%09; '
                             'Hm_lvt_46d556462595ed05e05f009cdafff31a=1658453541,1658454628,1658734624,1658744492; '
                             'Hm_lpvt_46d556462595ed05e05f009cdafff31a=1658744492'
                   }
        r = requests.get(web_url, headers=headers)
        # r.encoding = r.apparent_encoding
        r.encoding = 'gbk'
        web_code = r.text

        # 抓取帖子信息
        search_tag = r'<a href="(.*?)" target="_blank" class="xst".*?>(.*?)</a>'
        post_info[page] = re.findall(search_tag, web_code)

        # 获得帖子的标题、编号、链接，存放于post_info与post_url
        for item in post_info[page]:
            i = post_info[page].index(item)
            item = list(reversed(list(item)))
            if '&amp;' in item[0]:
                item[0] = item[0].replace('&amp;', '&')
            post_url[page].append('https://www.52xxxxxxx.cn/' + item[1][:item[1].find('.html') + 5])
            item[1] = item[1][item[1].find('thread-') + 7:item[1].find('.html') - 4]
            post_info[page][i] = item
        print('已抓取第{}页，该页有{}条帖子。'.format(page+1, len(post_info[page])))
    post_info_tmp = post_url_tmp = []
    for i in range(len(post_info)):
        post_info_tmp = post_info_tmp + post_info[i]
        post_url_tmp = post_url_tmp + post_url[i]
    post_info, post_url = post_info_tmp, post_url_tmp
    # 将信息集中到post_info中
    for i in range(len(post_info)):
        post_info[i].append(post_url[i])

    # 去重复
    # 比较列表里有无重复
    repeat = 0
    list_tmp = []
    for i in post_info:
        if not i in list_tmp:
            list_tmp.append(i)
        else:
            repeat += 1
    post_info = list_tmp
    # 比较编号与链接是否匹配，避免操作有误
    for i in range(len(post_info)):
        if post_info[i][1] not in post_info[i][2]:
            del post_info[i]
            repeat += 1
    print('去除了{}条重复，{}板块合计抓取到{}条帖子。'.format(repeat, section_name,  len(post_info)))

    return post_info

# 保存信息至表格文件。入参=工作表名称/RequestWeb返回的二维列表/输出文件位置，无出参。
def OutputExcel(sheet_title, post_info, output_doc):
    if not os.path.exists(output_doc):
        wb = openpyxl.Workbook()
        ws = wb.create_sheet(sheet_title)
        del wb['Sheet']
    else:
        wb = openpyxl.load_workbook(output_doc)
        ws = wb.create_sheet(sheet_title)
    # 表头
    ws.cell(1, 1).value, ws.cell(1, 2).value = '标题', '链接'
    ws.cell(1, 1).font = ws.cell(1, 2).font = Font(name='微软雅黑', size=18, color='000000', bold=True)
    ws.cell(1, 1).alignment = ws.cell(1, 2).alignment = Alignment(horizontal='center', vertical='center')
    for i in post_info:
        ws.cell(post_info.index(i) + 2, 1, value=i[0])
        ws.cell(post_info.index(i) + 2, 2, value=i[1])
    for row in ws.rows:
        if row[0].coordinate == 'A1':
            continue
        for cell in row:
            cell.font = Font(name='微软雅黑', size=12, color='000000')
            cell.alignment = Alignment(horizontal='left', vertical='center')
    wb.save(output_doc)

    # 调整表格列宽，提高美观度
    # 参考：https://blog.csdn.net/qq_33704787/article/details/124722917
    wb = openpyxl.load_workbook(output_doc)
    ws = wb[sheet_title]
    dims = {}  # 设置一个字典用于保存列宽数据
    for row in ws.rows:  # 遍历表格数据，获取自适应列宽数据
        for cell in row:
            if cell.value:
                # 遍历整个表格，把该列所有的单元格文本进行长度对比，找出最长的单元格
                # 在对比单元格文本时需要将中文字符识别为1.1个长度，英文字符识别为1个，这里只需要将文本长度直接加上中文字符数量即可
                # re.findall('([\u4e00-\u9fa5])', cell.value)能够识别大部分中文字符
                cell_len = 1.1 * len(re.findall('([\u4e00-\u9fa5])', str(cell.value))) + len(str(cell.value))
                dims[cell.column] = max((dims.get(cell.column, 0), cell_len))
    for col, value in dims.items():
        # 设置列宽，get_column_letter用于获取数字列号对应的字母列号，最后值+5是用来调整最终效果的
        ws.column_dimensions[openpyxl.utils.get_column_letter(col)].width = value + 5

    # 设置超链接
    i = 0
    for cell in tuple(ws.columns)[1]:
        if cell.coordinate == 'B1':
            continue
        cell.value = '=HYPERLINK("{}", "{}")'.format(post_info[i][2], cell.value)
        cell.font = Font(name='微软雅黑', size=12, underline='single', color='0000ff')
        cell.alignment = Alignment(horizontal='center', vertical='center')
        i += 1
    wb.save(output_doc)

if __name__ == "__main__":
    # 文件名附加抓取时间
    # import time     #当前时间
    # current_time = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))
    # output_doc = 'D:\\52_' + current_time + '.xlsx'
    output_doc = 'D:\\52.xlsx'
    # 待抓取的52板块
    section_52 = {
        '人气热门': ['https://www.52xxxxxxx.cn/forum.php?mod=guide&view=hot', 3],
        '技术分享': ['https://www.52xxxxxxx.cn/forum.php?mod=guide&view=tech', 5],
        '新鲜出炉': ['https://www.52xxxxxxx.cn/forum.php?mod=guide&view=newthread', 8],
        '精华采撷': ['https://www.52xxxxxxx.cn/forum.php?mod=guide&view=digest', 2],
    }
    for item in section_52.keys():
        post_info = RequestWeb_52(item, section_52[item][0], section_52[item][1])
        OutputExcel(item, post_info, output_doc)
    print('----------运行完成，输出文件已产生。----------')
    os.startfile(output_doc)

最开始是只爬“人气热门”板块来着的，后来又拓展了一下。
逐步说一说思路。
首先request拿源码，re.findall的正则抓有效信息，即帖子标题和帖子链接。
我用了一下BeautifulSoup库，它拿到的也是网页源码，通过类似 bf.find_all('div', class_ = 'bookname') 这种查找标签的形式抓有效信息附近好几行文本，然后仍然要用正则处理一下。那为啥不直接re.findall捏，拿到的无用信息还能少一点。
前期post_info存储的诸如 [('thread-1662751-1-1.html', '硬盘xxxxxx'), ('thread-1660647-1-1.html', '深度清理xxxxxx')] 这种。可以看到 [0] 放的是半截的帖子链接，加个前缀就是链接，去掉多余的就是编号。[1] 放的就是帖子标题。
处理之后post_info长这样： [['硬盘xxxxxx', '1662751', 'https://www.52.cn/thread-1662751-1-1.html'], ['深度清理xxxxxx', '1660647', 'https://www.52.cn/thread-1660647-1-1.html']] 。
因为每个版块网页都有好几页，爬完之后会有重复的帖子，也不知道啥原因，总之要经过一步去重复。
标题、编号、链接都处理好了，把这个列表给到表格进行输出。
表格这一步就没啥了，加个表头，把信息挨个写进去，给网址设一下超链接，改一下样式格式列宽，美观实用一点。这里感谢列宽模块的作者，模块挺好用。

运行日志：

运行日志

生成的表格：

生成的表格

3. 后记

整个大约花了三天时间，其中半天研究网页，一天修bug，合理。
上边放出来的那个代码，有时候跑会出现list out of range啥啥的报错，但有时候又没有，我不到啊。
不过不报错的时候保存的表格是OK的，链接也都正确，总的来说完成了目标，可喜可贺。
虽然我写了四个板块，但我确实只逛“人气热门”板块，因为其他的看不懂。