电子游戏销售数据可视化分析及预测

对全球电子游戏销售数据进行可视化展示,并使用支持向量回归进行预测分析

数据可视化部分

In [1]:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:

df = pd.read_csv("/home/mw/input/games6014/games_utf8.csv")

In [9]:

df.head(5)

Out[9]:

indexNamePlatformYear_of_ReleaseGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_SalesCritic_ScoreCritic_CountUser_ScoreUser_CountDeveloperRating
00Wii SportsWii2006SportsNintendo41.3628.963.778.4582.5376.051.08322.0NintendoE
11Super Mario Bros.NES1985PlatformNintendo29.083.586.810.7740.24NaNNaNNaNNaNNaNNaN
22Mario Kart WiiWii2008RacingNintendo15.6812.763.793.2935.5282.073.08.3709.0NintendoE
33Wii Sports ResortWii2009SportsNintendo15.6110.933.282.9532.7780.073.08192.0NintendoE
44Pokemon Red/Pokemon BlueGB1996Role-PlayingNintendo11.278.8910.221.0031.37NaNNaNNaNNaNNaNNaN

In [10]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16928 entries, 0 to 16927
Data columns (total 17 columns):
index              16928 non-null int64
Name               16926 non-null object
Platform           16928 non-null object
Year_of_Release    16655 non-null object
Genre              16926 non-null object
Publisher          16873 non-null object
NA_Sales           16928 non-null float64
EU_Sales           16928 non-null float64
JP_Sales           16928 non-null float64
Other_Sales        16928 non-null float64
Global_Sales       16926 non-null float64
Critic_Score       8260 non-null float64
Critic_Count       8260 non-null float64
User_Score         10159 non-null object
User_Count         7718 non-null float64
Developer          10240 non-null object
Rating             10092 non-null object
dtypes: float64(8), int64(1), object(8)
memory usage: 2.2+ MB

查看各个字段不同数量

可以看到有
有16928条销售信息
有11562个不同的游戏
有33个不同的游戏平台

In [11]:

df.nunique()

Out[11]:

index              16928
Name               11562
Platform              33
Year_of_Release       40
Genre                 14
Publisher            582
NA_Sales             402
EU_Sales             307
JP_Sales             244
Other_Sales          155
Global_Sales         629
Critic_Score          82
Critic_Count         106
User_Score            96
User_Count           888
Developer           1696
Rating                 8
dtype: int64

最畅销的游戏

在将不同平台上的相同游戏分组后,以下是前 5 名最畅销的游戏及其以百万计的销售数量。

In [13]:

df.groupby('Name')['Global_Sales'].sum().sort_values(ascending=False).head()

Out[13]:

Name
Wii Sports            82.53
Grand Theft Auto V    56.57
Super Mario Bros.     45.31
Tetris                35.84
Mario Kart Wii        35.52
Name: Global_Sales, dtype: float64

In [14]:

df.groupby('Name')['Global_Sales'].sum().describe().T

Out[14]:

count    11562.000000
mean         0.789696
std          2.284268
min          0.000000
25%          0.060000
50%          0.190000
75%          0.610000
max         82.530000
Name: Global_Sales, dtype: float64

数据清洗

保留需要的字段并去空值

In [21]:

df = df.iloc[:,0:11] 
df = df.dropna().sort_values(by='Year_of_Release')

绘制描述游戏销售变化的折线图。

先对数据进行处理分析方便可视化

In [26]:

df_year = df.groupby('Year_of_Release').sum()
df_year['count'] = df.groupby('Year_of_Release').count()['index']
df_year = df_year.reset_index()
df_year['sales_to_count'] = df_year['Global_Sales']/df_year['count']
df_year['Year_of_Release'] = df_year['Year_of_Release'].astype('int32')
df_year

Out[26]:

Year_of_ReleaseindexNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Salescountsales_to_count
019802974610.590.670.000.1211.3891.264444
1198118988533.401.960.000.3235.77460.777609
2198219923328.661.750.000.3430.73390.787949
31983565707.760.808.100.1416.79170.987647
419842286933.282.1014.270.7050.36143.597143
519855562433.734.7414.560.9253.94143.852857
619863590612.502.8419.811.9337.07211.765238
71987546248.461.4111.630.2021.74161.358750
819883725023.876.5915.760.9947.22153.148000
919894013145.158.4418.361.5073.45174.320588
1019902506225.467.6314.881.4049.39163.086875
11199122925813.084.0015.030.7532.86420.782381
12199220857433.8911.7128.911.6576.17431.771395
13199339570615.424.6525.650.8946.60610.763934
14199486897328.1614.8833.992.2079.181210.654380
151995193142224.8414.9046.992.6489.362220.402523
161996197941090.3449.8559.628.03207.832670.778390
171997196419195.7949.0349.469.29203.472930.694437
1819982742972137.9171.3252.8611.64273.863860.709482
1919992271203127.9463.5052.5710.18254.343430.741516
202000245876896.1453.5142.7711.75204.103540.576554
2120013523887180.4296.7441.9223.15342.254900.698469
2220027040098222.19114.1042.6028.52407.958410.485077
2320036270393200.59106.5735.8226.42369.667840.471505
2420045853176225.21110.8941.7348.68426.837540.566088
2520057836046245.33122.0254.4940.71463.049440.490508
2620069712227267.34131.0375.1155.15529.1110140.521805
27200710522058311.97159.0161.3777.36610.3312070.505659
28200812393789356.60184.0360.6783.17684.7614410.475198
29200912809114345.41194.5162.3376.12678.4414480.468536
30201011308061305.09173.4760.6659.32599.0512660.473183
31201110270400244.76167.6855.1254.42522.1711460.455646
3220125714648159.17117.5253.2637.05367.006620.554381
3320134537392154.62122.6947.6938.72363.725490.662514
3420145175494137.61127.5041.0838.08344.235950.578538
3520156019607110.64100.8134.4531.75277.716120.453775
362016567059648.2152.9622.1714.98138.485050.274218
372017466930.000.000.060.000.0630.020000
38202059360.270.000.000.020.2910.290000

在左侧,显示全球总销量,即当年所有游戏的全球销量之和。
在右侧,显示每款游戏的全球销售额或每款游戏的平均销售额。

In [32]:

fig, (ax0,ax1) = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
sns.lineplot(df_year['Year_of_Release'], df_year['Global_Sales'],ax=ax0)
sns.lineplot(df_year['Year_of_Release'], df_year['sales_to_count'],ax=ax1)
plt.show()

绘制游戏类型与销售额的比较图

In [27]:

df_genre = df.groupby('Genre').sum()
df_genre['count'] = df.groupby('Genre').count()['index']
df_genre = df_genre.reset_index()
df_genre['genre_to_count'] = df_genre['Global_Sales']/df_genre['count']
df_genre

Out[27]:

GenreindexNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Salescountgenre_to_count
0Action27483058877.67517.89162.21184.621743.4633460.521058
1Adventure15170655102.1963.7353.0216.60235.6512990.181409
2Fighting6635454231.01103.5491.6336.93463.068510.544136
3Misc15076819408.60216.54108.2776.14810.2017200.471047
4Platform6335987455.94206.14133.7152.38848.518940.949116
5Puzzle5558196121.1449.7857.3012.30240.965710.421996
6Racing9985753370.47244.4059.1077.92752.0812400.606516
7Role-Playing12300382338.04194.21360.9161.59954.4915010.635903
8Shooter9912089600.66326.3840.56107.481075.5613200.814818
9Simulation7489053183.35114.0163.6430.78391.928610.455192
10Sports17690030683.73375.80135.20133.541329.0023380.568435
11Strategy686951669.0045.3950.2010.92175.806770.259675

In [33]:

fig, (ax0,ax1) = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
sns.barplot(df_genre['Genre'], df_genre['Global_Sales'],ax=ax0)
ax0.set_xticklabels(ax0.get_xticklabels(), rotation=45)
sns.barplot(df_genre['Genre'], df_genre['genre_to_count'],ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
plt.show()

销售预测

编码

get_dummies 是利用pandas实现one hot encode的方式

In [36]:

df.head(1)

Out[36]:

indexNamePlatformYear_of_ReleaseGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
53605360Freeway26001980ActionActivision0.320.020.00.00.34

In [38]:

df_code = pd.get_dummies(df.iloc[:,2:11], drop_first=True)
df_code.head(1)

Out[38]:

NA_SalesEU_SalesJP_SalesOther_SalesGlobal_SalesPlatform_3DOPlatform_3DSPlatform_DCPlatform_DSPlatform_GB...Publisher_Zushi GamesPublisher_bitComposer GamesPublisher_dramatic createPublisher_fonfunPublisher_iWinPublisher_id SoftwarePublisher_imageepoch Inc.Publisher_inXile EntertainmentPublisher_mixi, IncPublisher_responDESIGN
53600.320.020.00.00.3400000...0000000000

1 rows × 662 columns

训练集测试集划分

划分训练集和测试集并标准化数据

In [39]:

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler

X = df_code.drop('Global_Sales', axis=1).values
y = df_code['Global_Sales'].values

# 切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 43)

In [40]:

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [41]:

X_train_scaled

Out[41]:

array([[ 0.58116251, -0.11136746, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.16340037, -0.09217906, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.2106742 , -0.16893264, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648],
       ...,
       [-0.31704033, -0.28406301, -0.09774392, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.31704033, -0.28406301, -0.23433317, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [ 0.13206109,  0.48347278, -0.23433317, ..., -0.00867338,
        -0.00867338, -0.01226648]])

In [42]:

X_test_scaled

Out[42]:

array([[-0.31704033, -0.28406301,  0.82423352, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.31704033, -0.20730943, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.00976041,  0.27240044, -0.20018586, ..., -0.00867338,
        -0.00867338, -0.01226648],
       ...,
       [-0.31704033, -0.16893264, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.31704033, -0.28406301, -0.16603855, ..., -0.00867338,
        -0.00867338, -0.01226648],
       [-0.04521578, -0.16893264, -0.26848048, ..., -0.00867338,
        -0.00867338, -0.01226648]])

SVR模型预测

创建一个SVR模型并使用测试集数据进行预测,最后输出预测的均方误差

In [44]:

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# 创建SVR模型
svr_model = SVR()

# 使用X_train_scaled和y_train来训练模型
svr_model.fit(X_train_scaled, y_train)

# 使用训练好的模型对X_test_scaled进行预测
y_pred = svr_model.predict(X_test_scaled)

# 计算预测均方误差
mse = mean_squared_error(y_test, y_pred)
print("均方误差:", mse)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
均方误差: 0.44936424854549994

结果可视化

黑色点代表真实值,蓝色线代表SVR的预测结果

In [45]:

import matplotlib.pyplot as plt

# 绘制真实值
plt.scatter(range(len(y_test)), y_test, color='black', label='True values')

# 绘制预测值
plt.plot(range(len(y_test)), y_pred, color='blue', linewidth=3, label='Predicted values')

plt.xlabel('Sample')
plt.ylabel('Value')
plt.title('SVR Prediction Results')
plt.legend()
plt.show()

  • 30
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

暴躁的秋秋

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值