📣 前言
- 👓 可视化主要使用 plotly
- 🔎 数据处理主要使用 pandas
- 🕷️ 数据爬取主要使用 requests
- 👉 本文是我自己在和鲸社区的原创
今天这篇文章将给大家介绍使用 【虎扑】Python数据分析可视化虎扑高校大学生锐评数据。
Step 1. 导入模块
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import jieba
from stylecloud import gen_stylecloud
from snownlp import SnowNLP
from IPython.display import Image # 用于在jupyter lab中显示本地图片
Step 2. 数据分析可视化
2.1 高校评分数据概览
数据下载:关注公众号【布鲁的Python之旅】,回复关键字【虎扑高校数据】获取。
df =pd.read_excel(r"/home/mw/input/hupu7901/高校评分-1_231129_1701226820.xlsx")
df.head(10)
输出结果:
print("——" * 10)
print('数据集存在重复值个数:')
print(df.duplicated().sum())
print("——" * 10)
print('数据集缺失值情况:')
print(df.isna().sum())
print("——" * 10)
print('数据集各字段类型:')
print(df.dtypes)
print("——" * 10)
print('数据总体概览:')
print(df.info())
输出结果:
————————————————————
数据集存在重复值个数:
0
————————————————————
数据集缺失值情况:
school_type 0
nodeId 0
bizId 0
name 0
scorePersonCount 0
totalScore 0
scoreAvg 0
scoreDistribution 0
hottestComments 0
爬取时间 0
dtype: int64
————————————————————
数据集各字段类型:
school_type object
nodeId int64
bizId int64
name object
scorePersonCount int64
totalScore int64
scoreAvg float64
scoreDistribution object
hottestComments object
爬取时间 object
dtype: object
————————————————————
数据总体概览:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8586 entries, 0 to 8585
Data columns (total 10 columns):
school_type 8586 non-null object
nodeId 8586 non-null int64
bizId 8586 non-null int64
name 8586 non-null object
scorePersonCount 8586 non-null int64
totalScore 8586 non-null int64
scoreAvg 8586 non-null float64
scoreDistribution 8586 non-null object
hottestComments 8586 non-null object
爬取时间 8586 non-null object
dtypes: float64(1), int64(4), object(5)
memory usage: 670.9+ KB
None
2.2 虎扑评分人数Top20
df_sorted = df.sort_values(by='scorePersonCount', ascending=False).head(20)
from plotly.subplots import make_subplots
fig2 = make_subplots(specs=[[{"secondary_y":True}]])
fig2.add_trace(
go.Scatter(x=df_sorted["name"],y=df_sorted["scoreAvg"],name="平均分"),secondary_y=True
)
fig2.add_trace(
go.Bar(x=df_sorted["name"],y=df_sorted["scorePersonCount"],name="评分人数",)
)
fig2.update_yaxes(go.layout.YAxis(tickformat="",showgrid=False),secondary_y=True)
fig2.update_layout(go.Layout(title=go.layout.Title(text="虎扑评分人数Top20")),template="plotly_white",)
输出结果:
2.3 虎扑高校热门评论词云
import ast
hottestComments = df["hottestComments"].tolist()
hots = []
for hottestComment in hottestComments:
hottestComment = ast.literal_eval(hottestComment)
for h in hottestComment:
hots.append(h)
title_content = ','.join(hots)
cut_text = jieba.cut(title_content)
result = ' '.join(cut_text)
# 读入停用词表
exclude = []
with open(r"/home/mw/中文停用词库.txt", 'r',encoding='gbk') as f:
lines = f.readlines()
for line in lines:
exclude.append(line.strip())
# 添加停用词
exclude.extend([""])
gen_stylecloud(
text=result, size=(1000, 800), max_words=500, max_font_size=80, font_path='simhei.ttf',
icon_name='fas fa-smile', output_name='school1.png',
# background_color='#05243F',
custom_stopwords=exclude,
)
Image(filename='school1.png')
输出结果:
2.4 高校评论数据概览
数据下载:关注公众号【布鲁的Python之旅】,回复关键字【虎扑高校数据】获取。
df1 =pd.read_excel(r"/home/mw/input/hupu7901/高校评论_231128_1701151794.xlsx")
df1.head(10)
输出结果:
print("——" * 10)
print('数据集存在重复值个数:')
print(df1.duplicated().sum())
print("——" * 10)
print('数据集缺失值情况:')
print(df1.isna().sum())
print("——" * 10)
print('数据集各字段类型:')
print(df1.dtypes)
print("——" * 10)
print('数据总体概览:')
print(df1.info())
输出结果:
————————————————————
数据集存在重复值个数:
0
————————————————————
数据集缺失值情况:
school_type 0
nodeId 0
bizId 0
school_name 0
commentId 0
commentUserId 0
commentUserName 0
commentContent 0
lightCount 0
blackCount 0
score 0
commentDate 0
publishTime 0
爬取时间 0
dtype: int64
————————————————————
数据集各字段类型:
school_type object
nodeId int64
bizId int64
school_name object
commentId int64
commentUserId int64
commentUserName object
commentContent object
lightCount int64
blackCount int64
score int64
commentDate object
publishTime object
爬取时间 object
dtype: object
————————————————————
数据总体概览:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131409 entries, 0 to 131408
Data columns (total 14 columns):
school_type 131409 non-null object
nodeId 131409 non-null int64
bizId 131409 non-null int64
school_name 131409 non-null object
commentId 131409 non-null int64
commentUserId 131409 non-null int64
commentUserName 131409 non-null object
commentContent 131409 non-null object
lightCount 131409 non-null int64
blackCount 131409 non-null int64
score 131409 non-null int64
commentDate 131409 non-null object
publishTime 131409 non-null object
爬取时间 131409 non-null object
dtypes: int64(7), object(7)
memory usage: 14.0+ MB
None
2.4 虎扑-国内各区域评论数分布
school_type_couns = df1['school_type'].value_counts().reset_index()
school_type_couns.rename(columns={"index": "区域", "school_type": "评论数"}, inplace=True)
school_type_couns
输出结果:
2.5 虎扑-国内各区域高校评论数分布
school_names_size = df1.groupby(['school_type', 'school_name']).size().sort_values(ascending=False).reset_index()
school_names_size.rename(columns={0: "评论数"}, inplace=True)
school_names_size.head(20)
输出结果:
2.6 所有评论数据词云
second_house_title = df1['commentContent']
title_content = ','.join([str(til.replace(' ', '')) for til in second_house_title.to_list()])
cut_text = jieba.cut(title_content)
result = ' '.join(cut_text)
# 读入停用词表
exclude = []
with open(r"/home/mw/中文停用词库.txt", 'r') as f:
lines = f.readlines()
for line in lines:
exclude.append(line.strip())
# 添加停用词
exclude.extend([""])
输出结果:
完整代码👇
https://www.heywhale.com/mw/project/6566d7434355fb41e6fccdca
ps:此代码可以直接在线运行,不需要担心环境配置问题
数据集下载
关注公众号,回复关键字【虎扑高校数据】获取
- END -
👆 关注**「布鲁的Python之旅」**第一时间收到更新