google play store app数据源 提取码: 38jk
google play store的app数据分析
1. 加载数据
- 加载数据分析使用的库
- 加载数据前,先用文本编辑器简单浏览一下数据
- 加载好数据之后,第一步先分别使用shape、head、count、describe和info方法看下数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 加载文件
# 这次只分析'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type'
df = pd.read_csv('./googleplaystore.csv', usecols=(0, 1, 2, 3, 4, 5, 6))
# 简单浏览下数据
print(df.head())
# 查看行列数量
print(df.shape)
# 查看各个列的非空数量
print(df.count())
# 使用describe和info方法看下数据的大概分布
print(df.describe())
print(df.info())
App Category Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
1 Coloring book moana ART_AND_DESIGN 3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
Reviews Size Installs Type
0 159 19M 10,000+ Free
1 967 14M 500,000+ Free
2 87510 8.7M 5,000,000+ Free
3 215644 25M 50,000,000+ Free
4 967 2.8M 100,000+ Free
(10841, 7)
App 10841
Category 10841
Rating 9367
Reviews 10841
Size 10841
Installs 10841
Type 10840
dtype: int64
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 7 columns):
App 10841 non-null object
Category 10841 non-null object
Rating 9367 non-null float64
Reviews 10841 non-null object
Size 10841 non-null object
Installs 10841 non-null object
Type 10840 non-null object
dtypes: float64(1), object(6)
memory usage: 592.9+ KB
None
- 从上面的运行结果得出
- 数据一共有10841行
- Rating和Type数据有缺失
- Rating有一个19的异常值
- Size的‘M’和‘k’和Installs的‘+’都需要处理,方便进一步计算
2. 数据清洗 # App
- 查看有没有重复值
print(df['App'].unique().size)
9660
- 有重复值,先不着急删除,为了不把其他列的异常值留下,先处理数值异常的列
3. 数据清洗 # Categoery
print(df['Category'].value_counts(dropna=False))
print(df[df['Category'] == '1.9'])
FAMILY 1972
GAME 1144
TOOLS 843
MEDICAL 463
BUSINESS 460
PRODUCTIVITY 424
PERSONALIZATION 392
COMMUNICATION 387
SPORTS 384
LIFESTYLE 382
FINANCE 366
HEALTH_AND_FITNESS 341
PHOTOGRAPHY 335
SOCIAL 295
NEWS_AND_MAGAZINES 283
SHOPPING 260
TRAVEL_AND_LOCAL 258
DATING 234
BOOKS_AND_REFERENCE 231
VIDEO_PLAYERS 175
EDUCATION 156
ENTERTAINMENT 149
MAPS_AND_NAVIGATION 137
FOOD_AND_DRINK 127
HOUSE_AND_HOME 88
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 85
WEATHER 82
ART_AND_DESIGN 65
EVENTS 64
COMICS 60
PARENTING 60
BEAUTY 53
1.9 1
Name: Category, dtype: int64
App Category Rating Reviews \
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M
Size Installs Type
10472 1,000+ Free 0
- 有一条异常值,观察发现应该是Category值缺失,所以这里删除这条数据
df.drop(index=10472, inplace=True)
4. 数据清洗 # Rating
print(df['Rating'].value_counts(dropna=False))
NaN 1474
4.4 1109
4.3 1076
4.5 1038
4.2 952
4.6 823
4.1 708
4.0 568
4.7 499
3.9 386
3.8 303
5.0 274
3.7 239
4.8 234
3.6 174
3.5 163
3.4 128
3.3 102
4.9 87
3.0 83
3.1 69
3.2 64
2.9 45
2.8 42
2.6 25
2.7 25
2.5 21
2.3 20
2.4 19
1.0 16
2.2 14
1.9 13
2.0 12
1.8 8
1.7 8
2.1 8
1.6 4
1.5 3
1.4 3
1.2 1
Name: Rating, dtype: int64
- 一共有1474条NaN值,用平均值来填充
df['Rating'].fillna(value=df['Rating'].mean(), inplace=True)
5. 数据清洗 # Reviews
print(df['Rating'].value_counts(dropna=False))
print(df['Reviews'].str.isnumeric().sum())
4.193338 1474
4.400000 1109
4.300000 1076
4.500000 1038
4.200000 952
4.600000 823
4.100000 708
4.000000 568
4.700000 499
3.900000 386
3.800000 303
5.000000 274
3.700000 239
4.800000 234
3.600000 174
3.500000 163
3.400000 128
3.300000 102
4.900000 87
3.000000 83
3.100000 69
3.200000 64
2.900000 45
2.800000 42
2.700000 25
2.600000 25
2.500000 21
2.300000 20
2.400000 19
1.000000 16
2.200000 14
1.900000 13
2.000000 12
2.100000 8
1.800000 8
1.700000 8
1.600000 4
1.400000