【Python数分实战】数据分析可视化虎扑高校大学生锐评数据

最新推荐文章于 2025-01-03 17:32:56 发布

Data 实验室

最新推荐文章于 2025-01-03 17:32:56 发布

阅读量1k

点赞数 6

分类专栏：数据分析可视化文章标签： python 数据分析数据可视化数据挖掘爬虫

本文链接：https://blog.csdn.net/HuJiaPeng123/article/details/138026992

版权

数据分析可视化专栏收录该内容

14 篇文章

订阅专栏

📣 前言

👓 可视化主要使用 plotly
🔎 数据处理主要使用 pandas
🕷️ 数据爬取主要使用 requests
👉 本文是我自己在和鲸社区的原创

今天这篇文章将给大家介绍使用 【虎扑】Python数据分析可视化虎扑高校大学生锐评数据。

Step 1. 导入模块

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import jieba
from stylecloud import gen_stylecloud
from snownlp import SnowNLP
from IPython.display import Image # 用于在jupyter lab中显示本地图片

Step 2. 数据分析可视化

2.1 高校评分数据概览

数据下载：关注公众号【布鲁的Python之旅】，回复关键字【虎扑高校数据】获取。

df =pd.read_excel(r"/home/mw/input/hupu7901/高校评分-1_231129_1701226820.xlsx")
df.head(10)

输出结果：

print("——" * 10)
print('数据集存在重复值个数：')
print(df.duplicated().sum())
print("——" * 10)
print('数据集缺失值情况：')
print(df.isna().sum())
print("——" * 10)
print('数据集各字段类型：')
print(df.dtypes)
print("——" * 10)
print('数据总体概览：')
print(df.info())

输出结果:

    ————————————————————
    数据集存在重复值个数：
    0
    ————————————————————
    数据集缺失值情况：
    school_type          0
    nodeId               0
    bizId                0
    name                 0
    scorePersonCount     0
    totalScore           0
    scoreAvg             0
    scoreDistribution    0
    hottestComments      0
    爬取时间                 0
    dtype: int64
    ————————————————————
    数据集各字段类型：
    school_type           object
    nodeId                 int64
    bizId                  int64
    name                  object
    scorePersonCount       int64
    totalScore             int64
    scoreAvg             float64
    scoreDistribution     object
    hottestComments       object
    爬取时间                  object
    dtype: object
    ————————————————————
    数据总体概览：
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8586 entries, 0 to 8585
    Data columns (total 10 columns):
    school_type          8586 non-null object
    nodeId               8586 non-null int64
    bizId                8586 non-null int64
    name                 8586 non-null object
    scorePersonCount     8586 non-null int64
    totalScore           8586 non-null int64
    scoreAvg             8586 non-null float64
    scoreDistribution    8586 non-null object
    hottestComments      8586 non-null object
    爬取时间                 8586 non-null object
    dtypes: float64(1), int64(4), object(5)
    memory usage: 670.9+ KB
    None

2.2 虎扑评分人数Top20

df_sorted = df.sort_values(by='scorePersonCount', ascending=False).head(20)

from plotly.subplots import make_subplots

fig2 = make_subplots(specs=[[{"secondary_y":True}]])

fig2.add_trace(
    go.Scatter(x=df_sorted["name"],y=df_sorted["scoreAvg"],name="平均分"),secondary_y=True
)
fig2.add_trace(
    go.Bar(x=df_sorted["name"],y=df_sorted["scorePersonCount"],name="评分人数",)
)
fig2.update_yaxes(go.layout.YAxis(tickformat="",showgrid=False),secondary_y=True)
fig2.update_layout(go.Layout(title=go.layout.Title(text="虎扑评分人数Top20")),template="plotly_white",)

输出结果：

2.3 虎扑高校热门评论词云

import ast

hottestComments = df["hottestComments"].tolist()

hots = []
for hottestComment in hottestComments:
    hottestComment = ast.literal_eval(hottestComment)
    for h in hottestComment:
        hots.append(h)

title_content = ','.join(hots)
cut_text = jieba.cut(title_content)
result = ' '.join(cut_text)

# 读入停用词表
exclude = [] 

with open(r"/home/mw/中文停用词库.txt", 'r',encoding='gbk') as f:
    lines = f.readlines()
    for line in lines:
        exclude.append(line.strip())
 # 添加停用词
exclude.extend([""])

gen_stylecloud(
    text=result, size=(1000, 800), max_words=500, max_font_size=80, font_path='simhei.ttf',
    icon_name='fas fa-smile', output_name='school1.png',
#     background_color='#05243F', 
    custom_stopwords=exclude,
)
Image(filename='school1.png')

输出结果：

2.4 高校评论数据概览

数据下载：关注公众号【布鲁的Python之旅】，回复关键字【虎扑高校数据】获取。

df1 =pd.read_excel(r"/home/mw/input/hupu7901/高校评论_231128_1701151794.xlsx")
df1.head(10)

输出结果：

print("——" * 10)
print('数据集存在重复值个数：')
print(df1.duplicated().sum())
print("——" * 10)
print('数据集缺失值情况：')
print(df1.isna().sum())
print("——" * 10)
print('数据集各字段类型：')
print(df1.dtypes)
print("——" * 10)
print('数据总体概览：')
print(df1.info())

输出结果：

    ————————————————————
    数据集存在重复值个数：
    0
    ————————————————————
    数据集缺失值情况：
    school_type        0
    nodeId             0
    bizId              0
    school_name        0
    commentId          0
    commentUserId      0
    commentUserName    0
    commentContent     0
    lightCount         0
    blackCount         0
    score              0
    commentDate        0
    publishTime        0
    爬取时间               0
    dtype: int64
    ————————————————————
    数据集各字段类型：
    school_type        object
    nodeId              int64
    bizId               int64
    school_name        object
    commentId           int64
    commentUserId       int64
    commentUserName    object
    commentContent     object
    lightCount          int64
    blackCount          int64
    score               int64
    commentDate        object
    publishTime        object
    爬取时间               object
    dtype: object
    ————————————————————
    数据总体概览：
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 131409 entries, 0 to 131408
    Data columns (total 14 columns):
    school_type        131409 non-null object
    nodeId             131409 non-null int64
    bizId              131409 non-null int64
    school_name        131409 non-null object
    commentId          131409 non-null int64
    commentUserId      131409 non-null int64
    commentUserName    131409 non-null object
    commentContent     131409 non-null object
    lightCount         131409 non-null int64
    blackCount         131409 non-null int64
    score              131409 non-null int64
    commentDate        131409 non-null object
    publishTime        131409 non-null object
    爬取时间               131409 non-null object
    dtypes: int64(7), object(7)
    memory usage: 14.0+ MB
    None

2.4 虎扑-国内各区域评论数分布

school_type_couns = df1['school_type'].value_counts().reset_index()
school_type_couns.rename(columns={"index": "区域", "school_type": "评论数"}, inplace=True)
school_type_couns

输出结果：

在这里插入图片描述

2.5 虎扑-国内各区域高校评论数分布

school_names_size = df1.groupby(['school_type', 'school_name']).size().sort_values(ascending=False).reset_index()

school_names_size.rename(columns={0: "评论数"}, inplace=True)

school_names_size.head(20)

输出结果：

在这里插入图片描述

2.6 所有评论数据词云

second_house_title = df1['commentContent']
title_content = ','.join([str(til.replace(' ', '')) for til in second_house_title.to_list()])
cut_text = jieba.cut(title_content)
result = ' '.join(cut_text)

# 读入停用词表
exclude = [] 

with open(r"/home/mw/中文停用词库.txt", 'r') as f:
    lines = f.readlines()
    for line in lines:
        exclude.append(line.strip())
 # 添加停用词
exclude.extend([""])