【Python】爬取TapTap原神评论并生成词云分析

最新推荐文章于 2024-05-10 10:51:51 发布

includei

最新推荐文章于 2024-05-10 10:51:51 发布

阅读量7.1k

点赞数 5

分类专栏： Python 文章标签： Python TapTap 原神评论词云

本文链接：https://blog.csdn.net/includei/article/details/111666995

版权

本文介绍了如何使用Python爬取TapTap上的原神游戏评论，通过分词处理和词云生成，对玩家评论进行词频分析和可视化，以洞察玩家反馈。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

序言

本来是想爬B站的，但是B站游戏区的评论好像是动态方式加载，分析了一通没搞懂怎么爬，所以转到了TapTap，TapTap评论页通过URL来定位，非常容易拼接URL去获取想要的页面，所以这次爬取的对象选为TapTap。

目标

爬取TapTap社区原神游戏下玩家的评论，生成词频，词云，可视化关键词。

步骤

爬虫

目标是爬取用户名、评分、时间、评论四个维度的信息，首先要获取到页面上的评论列表：

response = requests.get(self.comments_url % page, headers=self.headers)
print('访问第', page, '页，状态是', response.status_code, '。')
time.sleep(random.random())
html = etree.HTML(response.text)
contents = html.xpath('//ul[contains(@class, "taptap-review-list")]/li')

然后遍历列表解析出各个字段：

user = content.xpath('.//a[@class="taptap-user-name"]/text()')[0] or '无名氏'
score = content.xpath('.//div[@class="item-text-score"]/i[@class="colored"]/@style')[0][7:9]
score = int(score) / 14
comment_time = content.xpath('(.//span)[4]/text()')[0]
comment = content.xpath('(.//div[@class="item-text-body"])[1]/p/text()')
comment = '\n'.join(comment)

最后把数据存入文件供之后使用：

comment_dir = {
   'user': users, 'score': scores, 'time': times, 'comment': comments}
comment_df = pd.DataFrame(comment_dir)
comment_df.to_csv('./tables/taptap_comments.csv')
comment_df['comment'].to_csv('./tables/comments.csv', index=False)

分词

爬虫拿到了数据，接下来就要对数据进行分词，这里使用的是jieba库：

jieba.load_userdict('./dictionary/my_dict.txt')
with open('./tables/comments.csv', 'r'

最低0.47元/天解锁文章