影刀RPA实现小红书评论内容：监控、采集、分析

Looper0331

已于 2024-08-08 16:56:28 修改

阅读量2.8k

点赞数 9

文章标签： rpa pandas 爬虫自动化

于 2024-08-08 15:43:42 首次发布

本文链接：https://blog.csdn.net/weixin_43741408/article/details/141026480

版权

1、注意事项：

多个笔记依次检测时，间隔时间最好设为60s左右，太短容易触发平台风控。
评论的excel文件，所有表格的单元格格式都设为 “文本” 格式。
企微的图片上传接口，生成的图片URL是永久有效的，但每月上传文件的上限是3000张。

2、应用流程截图：

部分1：

部分2：

部分3：

部分4：

部分5：

部分6：

删除评论元素JS:

function (element, input) {
    element.remove()
    return null;
}

3、调用模块脚本

图片上传模块（pic_upload）：

# 上传临时素材至企微接口
import os
import requests

def upload_file(access_token, file_path):
    url = f"https://qyapi.weixin.qq.com/cgi-bin/media/uploadimg?access_token={access_token}&type=file"

    headers = {
        'Content-Type': 'multipart/form-data',
        'Authorization': f'Bearer {access_token}'
    }

    with open(file_path, "rb") as f:
        files = {"filename": (os.path.basename(file_path), f.read())}
    
    res = requests.post(url, headers=headers, files=files)
    
    return res.json()

评论分析模块（comment）：

参考了：【淘宝天猫】Python数据可视化淘宝天猫店铺评论分析 - Heywhale.com

import xbot
from xbot import print, sleep
from .import package
from .package import variables as glv

import os
import pandas as pd
import numpy as np
import plotly.express as px

def excel_data(excel_name, sheet_name, output_dir):

    df = pd.read_excel(excel_name, sheet_name=sheet_name)

    df['comment_time'] = pd.to_datetime(df['comment_time'])
    df['comment_date'] = df['comment_time'].dt.date
    comment_num = df['comment_date'].value_counts().sort_index()

    def judge_comment(df, result):
    # 创建一个空数据框（5代表5个数据指标）
        judges = pd.DataFrame(np.zeros(5 * len(df)).reshape(len(df), 5),
                          columns=['医疗咨询', '价格咨询', '安全顾虑', '业务咨询', '其它评论'])

        for i in range(len(result)):
            word = str(result[i])

        # 检查是否包含关键词
            contains_keywords = (
                '0度' in word or '度数' in word or '手术' in word or '术后' in word or '效果' in word or
                '价格' in word or '多少钱' in word or '优惠' in word or '费用' in word or
                '影响' in word or '副作用' in word or '疼吗' in word or '后遗症' in word or '痛吗' in word or
                '预约' in word or '挂号' in word or '哪里' in word or '检查' in word
                )

        # 如果没有包含关键词，则设置'其它评论'为1
            if not contains_keywords:
                judges.iloc[i]['其它评论'] = 1

        # 根据关键词设置对应的列值
        # 医疗咨询
            if '0度' in word or '度数' in word or '手术' in word or '术后' in word or '效果' in word:
                judges.iloc[i]['医疗咨询'] = 1

        # 价格咨询
            if '价格' in word or '多少钱' in word or '优惠' in word or '费用' in word:
                judges.iloc[i]['价格咨询'] = 1

        # 安全顾虑
            if '影响' in word or '副作用' in word or '疼吗' in word or '后遗症' in word or '痛吗' in word:
                judges.iloc[i]['安全顾虑'] = 1

        # 业务咨询
            if '预约' in word or '挂号' in word or '哪里' in word or '检查' in word:
                judges.iloc[i]['业务咨询'] = 1

        final_result = pd.concat([df, judges], axis=1)

        return final_result


# 得到数据框(comment是表单中评论内容的字段)
    judge = judge_comment(df, result=df.comment)
    print(judge)

# 重要！重要！重要！
# 这里7代表从excel的第7列开始逐列累计，即对每项数据触发次数叠加
    rank = judge.iloc[:, 6:].sum().reset_index().sort_values(0, ascending=False)
    rank.columns = ['分类', '提及次数']
    rank['占比'] = rank['提及次数'] / rank['提及次数'].sum()
    rank['高级分类'] = rank['分类'].str[:-2]
    print(rank)

    rank_num = rank.groupby('高级分类')['提及次数'].sum().sort_values(ascending=False).reset_index()
    print(rank_num)

# 创建图表
    fig = px.pie(
        rank_num,
        names="高级分类",
        values="提及次数",
        color="提及次数",
        hole=0.3,
        template='plotly_dark'
    )

# 修改布局
    fig.update_layout(
        title={
            "text": "评 论 者 关 注 点 占 比 分 布",
            "y": 0.96,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": {"size": 50}
        },
        
        legend=dict(
            font=dict(size=20),
            yanchor="top",
            y=0.9,
            xanchor="left",
            x=0.8,
            orientation="v"
        ),

        margin=dict(
            l=100,
            r=100, 
            t=100, 
            b=100
        ),

        annotations=[
            {
            "text": "【医疗咨询】：0度、度数、手术、术后、效果\n\n【价格咨询】：价格、多少钱、优惠、费用\n\n【安全顾虑】：影响、副作用、疼吗、后遗症、痛吗\n\n【业务咨询】：预约、挂号、哪里、检查",    # 换行使用<br>
            "x": 0.5,
            "y": -0.1,  # y 值小于 0 可以进一步将文本推向底部
            "font": {"size": 22}
            }
        ]
    )

# 自定义颜色方案
    colors = ['#FF9999', '#66B2FF', '#99FF99', '#FFCC99']
    fig.update_traces(textposition='inside',
                  textinfo='percent+value+label',
                  textfont_size=20, marker=dict(colors=colors))
# 图片导出
    output_path = os.path.join(output_dir, 'comment_ca.jpg')
    fig.write_image(output_path, format='png', width=1920, height=1080, scale=1)

评论excel上传至远程服务器：