google play store的app数据分析

google play store app数据源 提取码: 38jk

google play store的app数据分析

1. 加载数据

  • 加载数据分析使用的库
  • 加载数据前,先用文本编辑器简单浏览一下数据
  • 加载好数据之后,第一步先分别使用shape、head、count、describe和info方法看下数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 加载文件 
# 这次只分析'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type'
df = pd.read_csv('./googleplaystore.csv', usecols=(0, 1, 2, 3, 4, 5, 6))

# 简单浏览下数据
print(df.head())
# 查看行列数量
print(df.shape)
# 查看各个列的非空数量
print(df.count())

# 使用describe和info方法看下数据的大概分布
print(df.describe())
print(df.info())
                                           App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

  Reviews  Size     Installs  Type  
0     159   19M      10,000+  Free  
1     967   14M     500,000+  Free  
2   87510  8.7M   5,000,000+  Free  
3  215644   25M  50,000,000+  Free  
4     967  2.8M     100,000+  Free  
(10841, 7)
App         10841
Category    10841
Rating       9367
Reviews     10841
Size        10841
Installs    10841
Type        10840
dtype: int64
            Rating
count  9367.000000
mean      4.193338
std       0.537431
min       1.000000
25%       4.000000
50%       4.300000
75%       4.500000
max      19.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 7 columns):
App         10841 non-null object
Category    10841 non-null object
Rating      9367 non-null float64
Reviews     10841 non-null object
Size        10841 non-null object
Installs    10841 non-null object
Type        10840 non-null object
dtypes: float64(1), object(6)
memory usage: 592.9+ KB
None
  • 从上面的运行结果得出
  • 数据一共有10841行
  • Rating和Type数据有缺失
  • Rating有一个19的异常值
  • Size的‘M’和‘k’和Installs的‘+’都需要处理,方便进一步计算

2. 数据清洗 # App

  • 查看有没有重复值
print(df['App'].unique().size)
9660
  • 有重复值,先不着急删除,为了不把其他列的异常值留下,先处理数值异常的列

3. 数据清洗 # Categoery

print(df['Category'].value_counts(dropna=False))
print(df[df['Category'] == '1.9'])
FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
COMICS                   60
PARENTING                60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64
                                           App Category  Rating Reviews  \
10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   

         Size Installs Type  
10472  1,000+     Free    0  
  • 有一条异常值,观察发现应该是Category值缺失,所以这里删除这条数据
df.drop(index=10472, inplace=True)

4. 数据清洗 # Rating

print(df['Rating'].value_counts(dropna=False))
NaN     1474
4.4     1109
4.3     1076
4.5     1038
4.2      952
4.6      823
4.1      708
4.0      568
4.7      499
3.9      386
3.8      303
5.0      274
3.7      239
4.8      234
3.6      174
3.5      163
3.4      128
3.3      102
4.9       87
3.0       83
3.1       69
3.2       64
2.9       45
2.8       42
2.6       25
2.7       25
2.5       21
2.3       20
2.4       19
1.0       16
2.2       14
1.9       13
2.0       12
1.8        8
1.7        8
2.1        8
1.6        4
1.5        3
1.4        3
1.2        1
Name: Rating, dtype: int64
  • 一共有1474条NaN值,用平均值来填充
df['Rating'].fillna(value=df['Rating'].mean(), inplace=True)

5. 数据清洗 # Reviews

print(df['Rating'].value_counts(dropna=False))
print(df['Reviews'].str.isnumeric().sum())
4.193338     1474
4.400000     1109
4.300000     1076
4.500000     1038
4.200000      952
4.600000      823
4.100000      708
4.000000      568
4.700000      499
3.900000      386
3.800000      303
5.000000      274
3.700000      239
4.800000      234
3.600000      174
3.500000      163
3.400000      128
3.300000      102
4.900000       87
3.000000       83
3.100000       69
3.200000       64
2.900000       45
2.800000       42
2.700000       25
2.600000       25
2.500000       21
2.300000       20
2.400000       19
1.000000       16
2.200000       14
1.900000       13
2.000000       12
2.100000        8
1.800000        8
1.700000        8
1.600000        4
1.400000        3
1.500000        3
1.200000        1
Name: Ratin
  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值