对全球电子游戏销售数据进行可视化展示,并使用支持向量回归进行预测分析
数据可视化部分
In [1]:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as snsIn [20]:
df = pd.read_csv("/home/mw/input/games6014/games_utf8.csv")In [9]:
df.head(5)Out[9]:
index Name Platform Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count User_Score User_Count Developer Rating 0 0 Wii Sports Wii 2006 Sports Nintendo 41.36 28.96 3.77 8.45 82.53 76.0 51.0 8 322.0 Nintendo E 1 1 Super Mario Bros. NES 1985 Platform Nintendo 29.08 3.58 6.81 0.77 40.24 NaN NaN NaN NaN NaN NaN 2 2 Mario Kart Wii Wii 2008 Racing Nintendo 15.68 12.76 3.79 3.29 35.52 82.0 73.0 8.3 709.0 Nintendo E 3 3 Wii Sports Resort Wii 2009 Sports Nintendo 15.61 10.93 3.28 2.95 32.77 80.0 73.0 8 192.0 Nintendo E 4 4 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37 NaN NaN NaN NaN NaN NaN In [10]:
df.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 16928 entries, 0 to 16927 Data columns (total 17 columns): index 16928 non-null int64 Name 16926 non-null object Platform 16928 non-null object Year_of_Release 16655 non-null object Genre 16926 non-null object Publisher 16873 non-null object NA_Sales 16928 non-null float64 EU_Sales 16928 non-null float64 JP_Sales 16928 non-null float64 Other_Sales 16928 non-null float64 Global_Sales 16926 non-null float64 Critic_Score 8260 non-null float64 Critic_Count 8260 non-null float64 User_Score 10159 non-null object User_Count 7718 non-null float64 Developer 10240 non-null object Rating 10092 non-null object dtypes: float64(8), int64(1), object(8) memory usage: 2.2+ MB查看各个字段不同数量
可以看到有
有16928条销售信息
有11562个不同的游戏
有33个不同的游戏平台In [11]:
df.nunique()Out[11]:
index 16928 Name 11562 Platform 33 Year_of_Release 40 Genre 14 Publisher 582 NA_Sales 402 EU_Sales 307 JP_Sales 244 Other_Sales 155 Global_Sales 629 Critic_Score 82 Critic_Count 106 User_Score 96 User_Count 888 Developer 1696 Rating 8 dtype: int64最畅销的游戏
在将不同平台上的相同游戏分组后,以下是前 5 名最畅销的游戏及其以百万计的销售数量。
In [13]:
df.groupby('Name')['Global_Sales'].sum().sort_values(ascending=False).head()Out[13]:
Name Wii Sports 82.53 Grand Theft Auto V 56.57 Super Mario Bros. 45.31 Tetris 35.84 Mario Kart Wii 35.52 Name: Global_Sales, dtype: float64In [14]:
df.groupby('Name')['Global_Sales'].sum().describe().TOut[14]:
count 11562.000000 mean 0.789696 std 2.284268 min 0.000000 25% 0.060000 50% 0.190000 75% 0.610000 max 82.530000 Name: Global_Sales, dtype: float64数据清洗
保留需要的字段并去空值
In [21]:
df = df.iloc[:,0:11] df = df.dropna().sort_values(by='Year_of_Release')绘制描述游戏销售变化的折线图。
先对数据进行处理分析方便可视化
In [26]:
df_year = df.groupby('Year_of_Release').sum() df_year['count'] = df.groupby('Year_of_Release').count()['index'] df_year = df_year.reset_index() df_year['sales_to_count'] = df_year['Global_Sales']/df_year['count'] df_year['Year_of_Release'] = df_year['Year_of_Release'].astype('int32') df_yearOut[26]:
Year_of_Release index NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales count sales_to_count 0 1980 29746 10.59 0.67 0.00 0.12 11.38 9 1.264444 1 1981 189885 33.40 1.96 0.00 0.32 35.77 46 0.777609 2 1982 199233 28.66 1.75 0.00 0.34 30.73 39 0.787949 3 1983 56570 7.76 0.80 8.10 0.14 16.79 17 0.987647 4 1984 22869 33.28 2.10 14.27 0.70 50.36 14 3.597143 5 1985 55624 33.73 4.74 14.56 0.92 53.94 14 3.852857 6 1986 35906 12.50 2.84 19.81 1.93 37.07 21 1.765238 7 1987 54624 8.46 1.41 11.63 0.20 21.74 16 1.358750 8 1988 37250 23.87 6.59 15.76 0.99 47.22 15 3.148000 9 1989 40131 45.15 8.44 18.36 1.50 73.45 17 4.320588 10 1990 25062 25.46 7.63 14.88 1.40 49.39 16 3.086875 11 1991 229258 13.08 4.00 15.03 0.75 32.86 42 0.782381 12 1992 208574 33.89 11.71 28.91 1.65 76.17 43 1.771395 13 1993 395706 15.42 4.65 25.65 0.89 46.60 61 0.763934 14 1994 868973 28.16 14.88 33.99 2.20 79.18 121 0.654380 15 1995 1931422 24.84 14.90 46.99 2.64 89.36 222 0.402523 16 1996 1979410 90.34 49.85 59.62 8.03 207.83 267 0.778390 17 1997 1964191 95.79 49.03 49.46 9.29 203.47 293 0.694437 18 1998 2742972 137.91 71.32 52.86 11.64 273.86 386 0.709482 19 1999 2271203 127.94 63.50 52.57 10.18 254.34 343 0.741516 20 2000 2458768 96.14 53.51 42.77 11.75 204.10 354 0.576554 21 2001 3523887 180.42 96.74 41.92 23.15 342.25 490 0.698469 22 2002 7040098 222.19 114.10 42.60 28.52 407.95 841 0.485077 23 2003 6270393 200.59 106.57 35.82 26.42 369.66 784 0.471505 24 2004 5853176 225.21 110.89 41.73 48.68 426.83 754 0.566088 25 2005 7836046 245.33 122.02 54.49 40.71 463.04 944 0.490508 26 2006 9712227 267.34 131.03 75.11 55.15 529.11 1014 0.521805 27 2007 10522058 311.97 159.01 61.37 77.36 610.33 1207 0.505659 28 2008 12393789 356.60 184.03 60.67 83.17 684.76 1441 0.475198 29 2009 12809114 345.41 194.51 62.33 76.12 678.44 1448 0.468536 30 2010 11308061 305.09 173.47 60.66 59.32 599.05 1266 0.473183 31 2011 10270400 244.76 167.68 55.12 54.42 522.17 1146 0.455646 32 2012 5714648 159.17 117.52 53.26 37.05 367.00 662 0.554381 33 2013 4537392 154.62 122.69 47.69 38.72 363.72 549 0.662514 34 2014 5175494 137.61 127.50 41.08 38.08 344.23 595 0.578538 35 2015 6019607 110.64 100.81 34.45 31.75 277.71 612 0.453775 36 2016 5670596 48.21 52.96 22.17 14.98 138.48 505 0.274218 37 2017 46693 0.00 0.00 0.06 0.00 0.06 3 0.020000 38 2020 5936 0.27 0.00 0.00 0.02 0.29 1 0.290000 在左侧,显示全球总销量,即当年所有游戏的全球销量之和。
在右侧,显示每款游戏的全球销售额或每款游戏的平均销售额。In [32]:
fig, (ax0,ax1) = plt.subplots(nrows=1,ncols=2,figsize=(15,5)) sns.lineplot(df_year['Year_of_Release'], df_year['Global_Sales'],ax=ax0) sns.lineplot(df_year['Year_of_Release'], df_year['sales_to_count'],ax=ax1) plt.show()
绘制游戏类型与销售额的比较图
In [27]:
df_genre = df.groupby('Genre').sum() df_genre['count'] = df.groupby('Genre').count()['index'] df_genre = df_genre.reset_index() df_genre['genre_to_count'] = df_genre['Global_Sales']/df_genre['count'] df_genreOut[27]:
Genre index NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales count genre_to_count 0 Action 27483058 877.67 517.89 162.21 184.62 1743.46 3346 0.521058 1 Adventure 15170655 102.19 63.73 53.02 16.60 235.65 1299 0.181409 2 Fighting 6635454 231.01 103.54 91.63 36.93 463.06 851 0.544136 3 Misc 15076819 408.60 216.54 108.27 76.14 810.20 1720 0.471047 4 Platform 6335987 455.94 206.14 133.71 52.38 848.51 894 0.949116 5 Puzzle 5558196 121.14 49.78 57.30 12.30 240.96 571 0.421996 6 Racing 9985753 370.47 244.40 59.10 77.92 752.08 1240 0.606516 7 Role-Playing 12300382 338.04 194.21 360.91 61.59 954.49 1501 0.635903 8 Shooter 9912089 600.66 326.38 40.56 107.48 1075.56 1320 0.814818 9 Simulation 7489053 183.35 114.01 63.64 30.78 391.92 861 0.455192 10 Sports 17690030 683.73 375.80 135.20 133.54 1329.00 2338 0.568435 11 Strategy 6869516 69.00 45.39 50.20 10.92 175.80 677 0.259675 In [33]:
fig, (ax0,ax1) = plt.subplots(nrows=1,ncols=2,figsize=(15,5)) sns.barplot(df_genre['Genre'], df_genre['Global_Sales'],ax=ax0) ax0.set_xticklabels(ax0.get_xticklabels(), rotation=45) sns.barplot(df_genre['Genre'], df_genre['genre_to_count'],ax=ax1) ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45) plt.show()
销售预测
编码
get_dummies 是利用pandas实现one hot encode的方式
In [36]:
df.head(1)Out[36]:
index Name Platform Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales 5360 5360 Freeway 2600 1980 Action Activision 0.32 0.02 0.0 0.0 0.34 In [38]:
df_code = pd.get_dummies(df.iloc[:,2:11], drop_first=True) df_code.head(1)Out[38]:
NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Platform_3DO Platform_3DS Platform_DC Platform_DS Platform_GB ... Publisher_Zushi Games Publisher_bitComposer Games Publisher_dramatic create Publisher_fonfun Publisher_iWin Publisher_id Software Publisher_imageepoch Inc. Publisher_inXile Entertainment Publisher_mixi, Inc Publisher_responDESIGN 5360 0.32 0.02 0.0 0.0 0.34 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 rows × 662 columns
训练集测试集划分
划分训练集和测试集并标准化数据
In [39]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold from sklearn.preprocessing import StandardScaler X = df_code.drop('Global_Sales', axis=1).values y = df_code['Global_Sales'].values # 切分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 43)In [40]:
# 数据标准化 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)In [41]:
X_train_scaledOut[41]:
array([[ 0.58116251, -0.11136746, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648], [-0.16340037, -0.09217906, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648], [-0.2106742 , -0.16893264, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648], ..., [-0.31704033, -0.28406301, -0.09774392, ..., -0.00867338, -0.00867338, -0.01226648], [-0.31704033, -0.28406301, -0.23433317, ..., -0.00867338, -0.00867338, -0.01226648], [ 0.13206109, 0.48347278, -0.23433317, ..., -0.00867338, -0.00867338, -0.01226648]])In [42]:
X_test_scaledOut[42]:
array([[-0.31704033, -0.28406301, 0.82423352, ..., -0.00867338, -0.00867338, -0.01226648], [-0.31704033, -0.20730943, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648], [-0.00976041, 0.27240044, -0.20018586, ..., -0.00867338, -0.00867338, -0.01226648], ..., [-0.31704033, -0.16893264, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648], [-0.31704033, -0.28406301, -0.16603855, ..., -0.00867338, -0.00867338, -0.01226648], [-0.04521578, -0.16893264, -0.26848048, ..., -0.00867338, -0.00867338, -0.01226648]])SVR模型预测
创建一个SVR模型并使用测试集数据进行预测,最后输出预测的均方误差
In [44]:
from sklearn.svm import SVR from sklearn.metrics import mean_squared_error # 创建SVR模型 svr_model = SVR() # 使用X_train_scaled和y_train来训练模型 svr_model.fit(X_train_scaled, y_train) # 使用训练好的模型对X_test_scaled进行预测 y_pred = svr_model.predict(X_test_scaled) # 计算预测均方误差 mse = mean_squared_error(y_test, y_pred) print("均方误差:", mse)/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning. "avoid this warning.", FutureWarning)均方误差: 0.44936424854549994结果可视化
黑色点代表真实值,蓝色线代表SVR的预测结果
In [45]:
import matplotlib.pyplot as plt # 绘制真实值 plt.scatter(range(len(y_test)), y_test, color='black', label='True values') # 绘制预测值 plt.plot(range(len(y_test)), y_pred, color='blue', linewidth=3, label='Predicted values') plt.xlabel('Sample') plt.ylabel('Value') plt.title('SVR Prediction Results') plt.legend() plt.show()