Google Play Store谷歌应用商店游戏数据分析

数据集:Google Play Store Apps 网址:https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv

此数据集包含了两个csv文件,一个是Google play store app的整体数据,一个是Google play store用户评论的数据。
用户评论数据主观性非常大,且内容少,所以这里我们选取的是Google play store app的整体数据进行分析。
Google play store文件包含了13个字段,分别是
App: Application name(应用名称)
Category: Category the app belongs to(分类)
Rating: Overall user rating of the app (as when scraped)(评分)
Reviews: Number of user reviews for the app (as when scraped)(评论数)
Size: Size of the app (as when scraped)(大小)
Installs: Number of user downloads/installs for the app (as when scraped)(下载/安装量)
Type: Paid or Free(付费与否)
Price: Price of the app (as when scraped)(价格)
Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult(内容分级)
Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.(次分类)
Last Updated: Date when the app was last updated on Play Store (as when scraped)
Current Ver: Current version of the app available on Play Store (as when scraped)
Android Ver: Min required Android version (as when scraped)

一、导入数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('D:/Textbooks/Kaggle/Google Play Store Apps/googleplaystore.csv')

此数据包含了10841行,13列。

data.shape
Out[291]: (10841, 13)

Rating 列的空缺值非常多,高至1474。

data.isna().sum().sort_values(ascending=False)
Out[293]: 
Rating            1474
Current Ver          8
Android Ver          3
Content Rating       1
Type                 1
Last Updated         0
Genres               0
Price                0
Installs             0
Size                 0
Reviews              0
Category             0
App                  0
dtype: int64

二、数据清洗

因为此处不对版本和更新时间进行分析,所以首先删除掉这三列。

data.drop(columns=['Android Ver','Current Ver','Last Updated'],
          inplace=True)
1. App
data['App'].unique().size
Out[295]: 9660

App在谷歌应用商店里不可以重名,这里需要删除重复值,确保分析结果准确。

data.drop_duplicates('App',inplace=True)
2. Category
print(data.Category.unique())
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']

Category 有一个异常值1.9,删除。

data=data[data.Category != '1.9']
3. Rating
data.Rating.isna().sum()

Rating 的空缺值非常多,删除的话会缺失很多数据,但是用平均数或者中位数填充也不妥当,所以这里选择忽略na值,不做处理。

4. Reviews

转换为数值型。

data.Reviews.dtype
Out[300]: dtype('O')
data.Reviews = pd.to_numeric(data.Reviews)
5. Size
data.Size.value_counts()
Out[306]: 
Varies with device    1227
11M                    182
12M                    181
13M                    177
14M                    177
                       ...
226k                     1
903k                     1
190k                     1
400k                     1
54k                      1
Name: Size, Length: 461, dtype: int64

Size数据去掉单位,统一转换成以k为单位的数值型。

def f(x):
    if x[-1] == 'M':
        res = float(x[:-1])*1024
    elif x[-1] == 'k':
        res = float(x[:-1])
    else:
        res = np.nan
    return res
data.Size = data.Size.apply(f)
6. Installs
 data.Installs.unique()
Out[308]: 
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+
  • 7
    点赞
  • 43
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值