Google Play Store谷歌应用商店游戏数据分析

最新推荐文章于 2023-04-04 19:20:30 发布

置顶

德德德真的是我

最新推荐文章于 2023-04-04 19:20:30 发布

阅读量5.6k

点赞数 7

分类专栏：数据分析文章标签：可视化游戏数据分析 python

本文链接：https://blog.csdn.net/EvaHoo/article/details/107851724

版权

数据集：Google Play Store Apps 网址：https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv

此数据集包含了两个csv文件，一个是Google play store app的整体数据，一个是Google play store用户评论的数据。
用户评论数据主观性非常大，且内容少，所以这里我们选取的是Google play store app的整体数据进行分析。
Google play store文件包含了13个字段，分别是
App: Application name（应用名称）
Category: Category the app belongs to（分类）
Rating: Overall user rating of the app (as when scraped)（评分）
Reviews: Number of user reviews for the app (as when scraped)（评论数）
Size: Size of the app (as when scraped)（大小）
Installs: Number of user downloads/installs for the app (as when scraped)（下载/安装量）
Type: Paid or Free（付费与否）
Price: Price of the app (as when scraped)（价格）
Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult（内容分级）
Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.（次分类）
Last Updated: Date when the app was last updated on Play Store (as when scraped)
Current Ver: Current version of the app available on Play Store (as when scraped)
Android Ver: Min required Android version (as when scraped)

一、导入数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('D:/Textbooks/Kaggle/Google Play Store Apps/googleplaystore.csv')

此数据包含了10841行，13列。

data.shape
Out[291]: (10841, 13)

Rating 列的空缺值非常多，高至1474。

data.isna().sum().sort_values(ascending=False)
Out[293]: 
Rating            1474
Current Ver          8
Android Ver          3
Content Rating       1
Type                 1
Last Updated         0
Genres               0
Price                0
Installs             0
Size                 0
Reviews              0
Category             0
App                  0
dtype: int64

二、数据清洗

因为此处不对版本和更新时间进行分析，所以首先删除掉这三列。

data.drop(columns=['Android Ver','Current Ver','Last Updated'],
          inplace=True)

1. App

data['App'].unique().size
Out[295]: 9660

App在谷歌应用商店里不可以重名，这里需要删除重复值，确保分析结果准确。

data.drop_duplicates('App',inplace=True)

2. Category

print(data.Category.unique())
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']

Category 有一个异常值1.9，删除。

data=data[data.Category != '1.9']

3. Rating

data.Rating.isna().sum()

Rating 的空缺值非常多，删除的话会缺失很多数据，但是用平均数或者中位数填充也不妥当，所以这里选择忽略na值，不做处理。

4. Reviews

转换为数值型。

data.Reviews.dtype
Out[300]: dtype('O')

data.Reviews = pd.to_numeric(data.Reviews)

5. Size

data.Size.value_counts()
Out[306]: 
Varies with device    1227
11M                    182
12M                    181
13M                    177
14M                    177
                       ...
226k                     1
903k                     1
190k                     1
400k                     1
54k                      1
Name: Size, Length: 461, dtype: int64

Size数据去掉单位，统一转换成以k为单位的数值型。

def f(x):
    if x[-1] == 'M':
        res = float(x[:-1])*1024
    elif x[-1] == 'k':
        res = float(x[:-1])
    else:
        res = np.nan
    return res
data.Size = data.Size.apply(f)

6. Installs

 data.Installs.unique()
Out[308]: 
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+

最低0.47元/天解锁文章

德德德真的是我

关注

7
点赞
踩
43

收藏

觉得还不错? 一键收藏
1
评论
Google Play Store谷歌应用商店游戏数据分析

数据集：Google Play Store Apps 网址：https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv此数据集包含了两个csv文件，一个是Google play store app的整体数据，一个是Google play store用户评论的数据。用户评论数据主观性非常大，且内容少，所以这里我们选取的是Google play store app的整体数据进行分析。Google pla
复制链接

扫一扫