爬虫入门---半自动爬虫爬取贴吧

最新推荐文章于 2024-09-07 22:15:35 发布

努力找实习ing

最新推荐文章于 2024-09-07 22:15:35 发布

阅读量1.1k

点赞数 2

分类专栏：爬虫文章标签： python 正则表达式爬虫

本文链接：https://blog.csdn.net/weixin_45968555/article/details/109170707

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

一、半自动爬虫
总结

一、半自动爬虫

1.半自动爬虫概念

所谓半自动爬虫，顾名思义就是一半手动一半自动地进行爬虫，手动的部分是把网页的源代码复制下来，自动部分是通过正则表达式把其中有效信息提取出来。

2.爬虫目标

我所选取的爬虫目标是本校的贴吧，其中某研究生学长说有历年的题目可以免费分享，只要留下你的QQ。(我觉得这是想进行社会工程学的行为。)

3.爬取的效果

在这里插入图片描述

4.源代码

import  re
import  csv

with open('source.txt', 'r', encoding='UTF-8') as f:
    source = f.read()

result_list = []

#获取每一个模块
every_reply = re.findall('d_post_content_main">(.*?)icon_wrap  icon_wrap_theme1 d_pb_icons', source, re.S)

#从每一个模块提取出各个楼层的发帖人姓名、发帖内容和时间
for each in every_reply:
    result = {}
    result['username'] = re.findall('username="(.*?)"', each, re.S)[0]
    result['content'] = re.findall('j_d_post_content  clearfix" style="display:;">(.*?)<', each, re.S)[-1].replace('            ','')
    result['reply_time'] = re.findall('&quot;date&quot;:&quot;(2020.*?)&quot', each, re.S)[0]
    result_list.append(result)

with open('tieba.csv', 'w', encoding='UTF-8') as f:
    writer = csv.DictWriter(f, fieldnames=['username', 'content', 'reply_time'])
    writer.writeheader()
    writer.writerows(result_list)