pandas实战日志20211114——数据预处理

最新推荐文章于 2022-05-03 13:42:52 发布

特仑苏的数据分析之路

最新推荐文章于 2022-05-03 13:42:52 发布

阅读量1.3k

点赞数

文章标签： python 数据分析数据挖掘

本文链接：https://blog.csdn.net/TerrenceMo/article/details/121328362

版权

1、数据预处理——查看空值

# 查看空值的方法

shop.info()  # 查看表结构，通过各字段数据类型及数据量

print(shop.isnull().sum()) # 查看各字段空值数量

# 输出结果1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   shop_id      2000 non-null   int64  
 1   city_name    2000 non-null   object 
 2   location_id  2000 non-null   int64  
 3   per_pay      2000 non-null   int64  
 4   score        1709 non-null   float64
 5   comment_cnt  1709 non-null   float64
 6   shop_level   2000 non-null   int64  
 7   cate_1_name  2000 non-null   object 
 8   cate_2_name  2000 non-null   object 
 9   cate_3_name  1415 non-null   object 
dtypes: float64(2), int64(4), object(4)
memory usage: 156.4+ KB

# 输出结果2：
shop_id          0
city_name        0
location_id      0
per_pay          0
score          291
comment_cnt    291
shop_level       0
cate_1_name      0
cate_2_name      0
cate_3_name    585
dtype: int64

空值处理方法：

a. 删除空值 df.dropna()

b. 填充空值 df.fillna()

2、数据预处理——查看重复值（经检验数据无重复值）

# 使用duplicated查看重复值：
# DataFrame.duplicated(self, subset=None, keep='first')
# subset：只考虑标识重复项的某些列，默认情况下使用所有列
# keep : {‘first’, ‘last’, False}, default ‘first’，标记重复项

print(shop[shop.duplicated(subset='shop_id')]) #查看商家id重复的数据

删除重复值方法：df.drop_duplicates()

3、数据预处理——查看异常值

# 主要使用排序方法sort_values()查看异常值

print(shop.sort_values(by='per_pay',ascending=False)) #查看人均消费分布（1~20）

特仑苏的数据分析之路

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas实战日志20211114——数据预处理

1、数据预处理——查看空值# 查看空值的方法shop.info() # 查看表结构，通过各字段数据类型及数据量print(shop.isnull().sum()) # 查看各字段空值数量# 输出结果1<class 'pandas.core.frame.DataFrame'>RangeIndex: 2000 entries, 0 to 1999Data columns (total 10 columns): # Column Non-Null Coun
复制链接

扫一扫