用python进行数据清理(上)

本文详细介绍了如何使用Python进行数据清理,涵盖数据缺失的处理,包括缺失数据的热图、百分比表和直方图的可视化,以及删除、估算和替换缺失值的方法。此外,还探讨了不规则数据(异常值)的检测,如通过柱状图、箱线图和描述统计分析。针对不必要的数据,如重复和不相关数据的识别与处理,以及非一致数据的处理,如大小写、格式和分类值的不一致性。通过实例展示了数据清理的重要性及其在数据预处理中的应用。
摘要由CSDN通过智能技术生成

数据清理是从数据集、表或数据库中检测和纠正(或删除)损坏或不准确的记录的过程,指的是识别数据中不完整、不正确、不准确或不相关的部分,然后进行替换、修改或删除不干净或者粗糙的数据。

为了使它更简单,我们用Python创建了这个新的完整的分步指南。你将学习如何寻找和清洁的技术:

  • 数据缺失
  • 数据不规则(异常值)
  • 非必要的数据(如重复值)
  • 非一致的数据

在本指南中,我们使用了来自Kaggle的俄罗斯住房数据集。这个项目的目标是预测俄罗斯的房价波动。我们不清理整个数据集,但将显示它的例子。

在进入清理过程之前,让我们简要地看一下数据。

# import packages
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

pd.options.mode.chained_assignment = None



# read the data
df = pd.read_csv('train.csv')

# shape and data types of the data (30471, 292)
print(df.shape)
print(df.dtypes)

# 打印数字列int/float
df_numeric = df.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)

# 打印非数字列object
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)

运行结果

(30471, 292)
id                                         int64
timestamp                                 object
full_sq                                    int64
life_sq                                  float64
floor                                    float64
max_floor                                float64
material                                 float64
build_year                               float64
num_room                                 float64
kitch_sq                                 float64
state                                    float64
product_type                              object
sub_area                                  object
area_m                                   float64
raion_popul                                int64
green_zone_part                          float64
indust_part                              float64
children_preschool                         int64
preschool_quota                          float64
preschool_education_centers_raion          int64
children_school                            int64
school_quota                             float64
school_education_centers_raion             int64
school_education_centers_top_20_raion      int64
hospital_beds_raion                      float64
healthcare_centers_raion                   int64
university_top_20_raion                    int64
sport_objects_raion                        int64
additional_education_raion                 int64
culture_objects_top_25                    object
                                          ...   
big_church_count_3000                      int64
church_count_3000                          int64
mosque_count_3000                          int64
leisure_count_3000                         int64
sport_count_3000                           int64
market_count_3000                          int64
green_part_5000                          float64
prom_part_5000                           float64
office_count_5000                          int64
office_sqm_5000                            int64
trc_count_5000                             int64
trc_sqm_5000                               int64
cafe_count_5000                            int64
cafe_sum_5000_min_price_avg              float64
cafe_sum_5000_max_price_avg              float64
cafe_avg_price_5000                      float64
cafe_count_5000_na_price                   int64
cafe_count_5000_price_500                  int64
cafe_count_5000_price_1000                 int64
cafe_count_5000_price_1500                 int64
cafe_count_5000_price_2500                 int64
cafe_count_5000_price_4000                 int64
cafe_count_5000_price_high                 int64
big_church_count_5000                      int64
church_count_5000                          int64
mosque_count_5000                          int64
leisure_count_5000                         int64
sport_count_5000                           int64
market_count_5000                          int64
price_doc                                  int64
Length: 292, dtype: object
['id' 'full_sq' 'life_sq' 'floor' 'max_floor' 'material' 'build_year'
 'num_room' 'kitch_sq' 'state' 'area_m' 'raion_popul' 'green_zone_part'
 'indust_part' 'children_preschool' 'preschool_quota'
 'preschool_education_centers_raion' 'children_school' 'school_quota'
 'school_education_centers_raion' 'school_education_centers_top_20_raion'
 'hospital_beds_raion' 'healthcare_centers_raion'
 'university_top_20_raion' 'sport_objects_raion'
 'additional_education_raion' 'culture_objects_top_25_raion'
 'shopping_centers_raion' 'office_raion' 'full_all' 'male_f' 'female_f'
 'young_all' 'young_male' 'young_female' 'work_all' 'work_male'
 'work_female' 'ekder_all' 'ekder_male' 'ekder_female' '0_6_all'
 '0_6_male' '0_6_female' '7_14_all' '7_14_male' '7_14_female' '0_17_all'
 '0_17_male' '0_17_female' '16_29_all' '16_29_male' '16_29_female'
 '0_13_all' '0_13_male' '0_13_female'
 'raion_build_count_with_material_info' 'build_count_block'
 'build_count_wood' 'build_count_frame' 'build_count_brick'
 'build_count_monolith' 'build_count_panel' 'build_count_foam'
 'build_count_slag' 'build_count_mix'
 'raion_build_count_with_builddate_info' 'build_count_before_1920'
 'build_count_1921-1945' 'build_count_1946-1970' 'build_count_1971-1995'
 'build_count_after_1995' 'ID_metro' 'metro_min_avto' 'metro_km_avto'
 'metro_min_walk' 'metro_km_walk' 'kindergarten_km' 'school_km' 'park_km'
 'green_zone_km' 'industrial_km' 'water_treatment_km' 'cemetery_km'
 'incineration_km' 'railroad_station_walk_km' 'railroad_station_walk_min'
 'ID_railroad_station_walk' 'railroad_station_avto_km'
 'railroad_station_avto_min' 'ID_railroad_station_avto'
 'public_transport_station_km' 'public_transport_station_min_walk'
 'water_km' 'mkad_km' 'ttk_km' 'sadovoe_km' 'bulvar_ring_km' 'kremlin_km'
 'big_road1_km' 'ID_big_road1' 'big_road2_km' 'ID_big_road2' 'railroad_km'
 'zd_vokzaly_avto_km' 'ID_railroad_terminal' 'bus_terminal_avto_km'
 'ID_bus_terminal' 'oil_chemistry_km' 'nuclear_reactor_km' 'radiation_km'
 'power_transmission_line_km' 'thermal_power_plant_km' 'ts_km'
 'big_market_km' 'market_shop_km' 'fitness_km' 'swim_pool_km'
 'ice_rink_km' 'stadium_km' 'basketball_km' 'hospice_morgue_km'
 'detention_facility_km' 'public_healthcare_km' 'university_km'
 'workplaces_km' 'shopping_centers_km' 'office_km'
 'additional_education_km' 'preschool_km' 'big_church_km'
 'church_synagogue_km' 'mosque_km' 'theater_km' 'museum_km'
 'exhibition_km' 'catering_km' 'green_part_500' 'prom_part_500'
 'office_count_500' 'office_sqm_500' '
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值