阿里云天池比赛记录

最新推荐文章于 2024-07-25 17:43:37 发布

qq_41560285

最新推荐文章于 2024-07-25 17:43:37 发布

阅读量896

点赞数

文章标签：阿里云云计算

本文链接：https://blog.csdn.net/qq_41560285/article/details/123064830

版权

1.【教学赛】数据分析达人赛1:用户情感可视化分析

#查看重复值
print(earphone_sentiment.duplicated().sum())

print("——————————————————")

#查看个字段缺失值
print(earphone_sentiment.isnull().sum())

print("——————————————————")
# 查看数据字段、非空值、数据类型等
earphone_sentiment.info()

#对不同情感倾向的不同主题的情感词进行透视，查看数据
earphone_sentiment.pivot_table(columns='sentiment_value',index='subject',values='sentiment_word',aggfunc="count")

对透视表不懂可以看这个https://www.cnblogs.com/Yanjy-OnlyOne/p/11195621.html

数据预处理部分（重点）

#读取停用词典
stop_words=[]
with open(r'./my_stop_words.txt','r') as f:
    for line in f:
        stop_words.append(line.strip('\n').split(',')[0])

#分词
df=earphone_sentiment.copy()

row,col=df.shape  #数据表的行数
df['cutwords'] = 'cutwords'  #预定义列表

for i in np.arange(row):
    cutword = [x for x in jieba.cut_for_search(df.content[i]) if len(x) > 1]  #分词并去除长度为1的词
    cutword = [k for k in cutword if k not in stop_words]  #去除停用词
    df.cutwords[i]=cutword
    
#查看全部分词结果
df.cutwords