【数据清理与特征工程】2-数据清理

超级虚空

已于 2022-07-27 19:06:32 修改

阅读量1.6k

点赞数

文章标签： python pandas 数据分析

于 2022-07-27 19:06:05 首次发布

本文链接：https://blog.csdn.net/m0_49376775/article/details/126021336

版权

数据准备和特征工程专栏收录该内容

4 篇文章 0 订阅

订阅专栏

2-0 基本概念

import pandas as pd

# 数据下载地址：https://aistudio.baidu.com/aistudio/projectdetail/4359784
df = pd.read_csv("data/pm2.csv")
df.sample(10)

	RANK	CITY_ID	CITY_NAME	Exposed days
42	47	610	嘉峪关	49
134	147	98	辽阳	100
104	116	91	锦州	88
36	41	487	阳江	47
31	36	495	揭阳	45
238	261	377	漯河	197
192	215	204	宿迁	140
220	243	252	阜阳	172
158	180	544	广元	119
110	122	478	肇庆	89

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   RANK          264 non-null    int64 
 1   CITY_ID       264 non-null    int64 
 2   CITY_NAME     264 non-null    object
 3   Exposed days  264 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 8.4+ KB

数据一共264行4列，其中不存在缺失值，

2-1 转化数据类型

注意下面的代码，我们未定义数据类型时默认为object

df = pd.DataFrame([{'col1':'a', 'col2':'1'},
                   {'col1':'b', 'col2':'2'}])
df.dtypes

col1    object
col2    object
dtype: object

现在来改变数据类型，使用astype函数

df['col2-int'] = df['col2'].astype(int)    # ①
df

	col1	col2	col2-int
0	a	1	1
1	b	2	2

现在我们成功将数据类型设置为int了

df.dtypes

col1        object
col2        object
col2-int     int32
dtype: object

考虑另一种情况，将不同数据类型看作一个数据类型

s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
try:
    s.astype(float)
except ValueError:
    print("ValueError")

ValueError

发现不能将 string 转化为 float

我们可以忽略错误并转换，得到 object
或者使用to_numeric函数，强制转换

#
# s.astype(float, errors='ignore')
pd.to_numeric(s, errors='coerce')

0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

案例

import pandas as pd

df = pd.read_csv('data/sales_types.csv')  # 教材中数据集文件名称为sales_data_types.csv，此处为适应平台要求稍作修改
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer Number  5 non-null      float64
 1   Customer Name    5 non-null      object 
 2   2016             5 non-null      object 
 3   2017             5 non-null      object 
 4   Percent Growth   5 non-null      object 
 5   Jan Units        5 non-null      object 
 6   Month            5 non-null      int64  
 7   Day              5 non-null      int64  
 8   Year             5 non-null      int64  
 9   Active           5 non-null      object 
dtypes: float64(1), int64(3), object(6)
memory usage: 528.0+ bytes

df

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active
0	10002.0	Quest Industries	$125,000.00	$162500.00	30.00%	500	1	10	2015	Y
1	552278.0	Smith Plumbing	$920,000.00	$101,2000.00	10.00%	700	6	15	2014	Y
2	23477.0	ACME Industrial	$50,000.00	$62500.00	25.00%	125	3	29	2016	Y
3	24900.0	Brekke LTD	$350,000.00	$490000.00	4.00%	75	10	27	2015	Y
4	651029.0	Harbor Co	$15,000.00	$12750.00	-15.00%	Closed	2	2	2014	N

分析数据发现有几个 object 对象没有数据类型

Customer Number

应该是ID之类的

df['Customer Number'] = df['Customer Number'].astype(int).astype(str)

Percent Growth

df['Percent Growth'] = df['Percent Growth'].apply(lambda x: float(x.replace("%", "")) / 100)

2016&2017

def cvt_money(val):
    new_value = val.replace("$","").replace(",","")  # ②
    return float(new_value)
df['2016'] = df['2016'].apply(cvt_money)
df['2017'] = df['2017'].apply(cvt_money)

Jan Units

df['Jan Units'] = pd.to_numeric(df['Jan Units'], errors='coerce')

Month-day-year

df['date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])

df['date']

0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
Name: date, dtype: datetime64[ns]

Active

import numpy as np
df['Active'] = np.where(df['Active']=='Y', 1, 0)

df

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active	date
0	10002	Quest Industries	125000.0	162500.0	0.30	500.0	1	10	2015	1	2015-01-10
1	552278	Smith Plumbing	920000.0	1012000.0	0.10	700.0	6	15	2014	1	2014-06-15
2	23477	ACME Industrial	50000.0	62500.0	0.25	125.0	3	29	2016	1	2016-03-29
3	24900	Brekke LTD	350000.0	490000.0	0.04	75.0	10	27	2015	1	2015-10-27
4	651029	Harbor Co	15000.0	12750.0	-0.15	NaN	2	2	2014	0	2014-02-02

练习

movies = pd.read_csv("data/movies.csv", index_col=0)
movies

	上映日期	片名	类型	制片国家/地区	想看	ID	导演	主演
0	05月31日	哥斯拉2：怪兽之王	动作 / 科幻 / 冒险	美国	40734人	25890017	迈克尔·道赫蒂	维拉·法米加\|米莉·波比·布朗\|章子怡\|莎莉·霍金斯\|布莱德利·惠特福德\|查尔斯·丹斯\|凯尔...
1	05月31日	尺八·一声一世	纪录片 / 音乐	中国大陆	5305人	27185648	聿馨	佐藤康夫\|小凑昭尚\|蔡鸿文\|徐浩鹏\|海山\|三桥贵风\|星梵竹\|三冢幸彦\|梁文道\|陆川\|龚琳娜
2	05月31日	卡拉斯：为爱而声	纪录片	法国	2047人	27089205	汤姆·沃尔夫	玛丽亚·卡拉斯\|维托里奥·德·西卡\|亚里士多德·奥纳西斯\|皮埃尔·保罗·帕索里尼\|奥马尔·沙...
3	05月31日	托马斯大电影之世界探险记	儿童 / 动画	英国	972人	30236340	大卫·斯特登	蒂娜·德赛\|约瑟夫·梅\|泰莉莎·加拉赫\|凯瑞·莎勒\|约翰·哈斯勒\|大卫·麦金\|金宝·张\|彼得...
4	05月31日	花儿与歌声	剧情 / 儿童 / 家庭	中国大陆	136人	33393269	王蕾	魏歆惠\|刘晨毅\|王润泽\|曹一诺\|周琳翌\|周北辰\|曹德祥\|郑陈皓淼
...	...	...	...	...	...	...	...	...
91	2020年01月25日	囧妈	剧情 / 喜剧	中国大陆	6987人	30306570	徐峥	徐峥\|王祖蓝\|彭昱畅\|潘虹
92	2020年01月25日	中国女排	剧情	中国大陆 / 香港	2671人	30128916	陈可辛	巩俐
93	2020年01月25日	大红包	喜剧 / 爱情	中国大陆	20人	33457717	李克龙	包贝尔\|李成敏\|许君聪\|王小利\|廖蔚蔚
94	2020年06月21日	六月的秘密	剧情 / 悬疑 / 音乐	中国大陆 / 美国	542人	30216731	王暘	郭富城\|苗苗\|吴建飞
95	2020年10月01日	黑色假面	剧情 / 悬疑	中国大陆	3957人	26986136	NaN	NaN

96 rows × 8 columns

movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   上映日期     96 non-null     object
 1   片名       96 non-null     object
 2   类型       96 non-null     object
 3   制片国家/地区  96 non-null     object
 4   想看       96 non-null     object
 5   ID       96 non-null     int64 
 6   导演       93 non-null     object
 7   主演       88 non-null     object
dtypes: int64(1), object(7)
memory usage: 6.8+ KB

movies['想看'] = movies['想看'].apply(lambda x: int(x.replace('人', "")))

练习

movies = pd.read_csv("data/movies.csv", index_col=0)
movies

movies.info()

movies['想看'] = movies['想看'].apply(lambda x: int(x.replace('人', "")))

2-2 处理重复数据

import pandas as pd
d = {'Name':['Newton', 'Galilei', 'Einstein', 'Feynman', 'Newton', 'Maxwell', 'Galilei'],
             'Age':[26, 30, 28, 28, 26, 39, 40],
             'Score':[90, 80, 90, 100, 90, 70, 90]}
df = pd.DataFrame(d,columns=['Name','Age','Score'])
df

基础知识

使用duplicated判断是不是重复的数据

df.duplicated()

df.drop_duplicates()

df.drop_duplicates('Age', keep='last')

df[df.duplicated()].count() / df.count()

项目案例

cpi = pd.read_excel("data/cpi.xls")
cpi.columns = cpi.iloc[1]    
cpi = cpi[2:]    
cpi.drop([11, 12], axis=0, inplace=True)    
cpi['cpi_index'] = ['总体消费', '食品烟酒', '衣着', '居住', '生活服务', '交通通信', '教育娱乐', '医保', '其他']
cpi.drop(['指标'], axis=1, inplace=True)    
cpi.reset_index(drop=True, inplace=True)    
cpi.columns.rename('', inplace=True)    
cpi

dup_ratio = []
for column in cpi.columns:
    col = cpi[column]
    ratio = col[col.duplicated()].count() / col.count()
    dup_ratio.append(round(ratio, 2))
dup_ratio

dr = pd.Series(dup_ratio, index=cpi.columns)
dr

2-3 处理缺失数据

在进行数据挖掘的过程中，理解和清洗数据是最耗费时间的事情。你应该知道数据是如何产生的，哪些特征对业务有影响，只有这样你才能给出最好的数据结果。

在本文中，我们将介绍缺失值的产生原因和缺失值具体的处理方法。

问题1：为什么有缺失值？

现实世界中的数据在大多数情况下都有很多缺失的数据。每个值丢失的原因可能不同。可能有数据丢失或损坏，或者也可能有特定原因。

缺失值产生的原因

数据丢失背后的一些可能原因（产生过程、传输过程、存储过程）：

人们不会在数据收集调查中提供有关某些问题的信息。
数据是从各种可用的过去记录中积累的，而不是直接积累的。
数据收集过程中的不准确也会导致数据丢失。

缺失值类型

数据丢失的原因多种多样，但整体可以将它们分为三个主要组：完全随机丢失、随机丢失、不随机丢失。

完全随机缺失 (MCAR)

现象：缺失的数据不遵循任何特定模式，它们只是随机的。
特点：不可能用其余的变量数据来预测这些值，数据的缺失与其余变量无关或独立。
案例：如在数据收集过程中，由于粗心大意丢失了特定样本

随机丢失 (MAR)

现象：数据在特定子集中丢失
特点：可以借助其他功能来预测数据是否存在/不存在，无法自己预测丢失的数据。
案例：如在数据收集过程中，有一些默认选项，可以不做填写

不随机丢失 (NMAR)

现象：确实的数据遵循某种模式，且与数据样本相关。
特点：删除行/列、插补等常用方法将不起作用，缺失的数据与字段相关。
案例：如在数据收集中，采集者根据字段来选择填写某些字段。

问题2：如何分析缺失值？

在Pandas中可以很方便的使用isnull函数来计算是否包含缺失值。

missing_values=train.isnull().sum()

同时也可以使用missingno库来查看缺失值的分布规律：

bar：统计每列缺失值的次数
matrix：统计缺失值和行数分布规律
heatmap：统计列缺失值的相关性
dendrogram：统计列确实的组合性

问题3：缺失值需要处理吗？

处理缺失值可以从两个角度考虑：

从数据角度：如果某列的缺失比例大于某一阈值（如大于90%），则可以考虑剔除列；类似的对行的角度也可以这样操作。
从模型角度：如果使用树模型则不用考虑处理，其他模型则需要进行填充或者剔除。

问题4：缺失值如何填充？

使用值填充

使用特殊值填充是最简单的填充方法，主要的优势是速度，可能会带来一定的噪音。

数值列：中位数、中位数、特殊值
类别列：众数、特殊值

使用模型预测

数据样本各列之间存在联系，此时可以从列与列的关系完成缺失值填充。

使用回归/分类模型预测列：使用其他列作为特征，待填充列作为标签；
使用自编码器预测缺失值：使用缺失数据作为输入，完整数据作为标签，完成自监督训练。

df = pd.DataFrame({"one":[10, 11, 12], 'two':[np.nan, 21, 22], "three":[30, np.nan, 33]})
df

df = pd.DataFrame({'ColA':[1, np.nan, np.nan, 4, 5, 6, 7], 'ColB':[1, 1, 1, 1, 2, 2, 2]})
df['ColA'].fillna(method='ffill')

df['ColA'].fillna(method='bfill')

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("/home/aistudio/data/data20510/experiment.csv", index_col=0)

fig, ax = plt.subplots()
ax.scatter(df['alpha'], df['belta'])

基础知识

使用duplicated判断是不是重复的数据

df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

df.drop_duplicates()

	Name	Age	Score
0	Newton	26	90
1	Galilei	30	80
2	Einstein	28	90
3	Feynman	28	100
5	Maxwell	39	70
6	Galilei	40	90

df.drop_duplicates('Age', keep='last')

	Name	Age	Score
1	Galilei	30	80
3	Feynman	28	100
4	Newton	26	90
5	Maxwell	39	70
6	Galilei	40	90

df[df.duplicated()].count() / df.count()

Name     0.142857
Age      0.142857
Score    0.142857
dtype: float64

项目案例

cpi = pd.read_excel("data/cpi.xls")
cpi.columns = cpi.iloc[1]    
cpi = cpi[2:]    
cpi.drop([11, 12], axis=0, inplace=True)    
cpi['cpi_index'] = ['总体消费', '食品烟酒', '衣着', '居住', '生活服务', '交通通信', '教育娱乐', '医保', '其他']
cpi.drop(['指标'], axis=1, inplace=True)    
cpi.reset_index(drop=True, inplace=True)    
cpi.columns.rename('', inplace=True)    
cpi

	2019年3月	2019年2月	2019年1月	2018年12月	2018年11月	2018年10月	2018年9月	2018年8月	2018年7月	2018年6月	2018年5月	2018年4月	cpi_index
0	102.3	101.5	101.7	101.9	102.2	102.5	102.5	102.3	102.1	101.9	101.8	101.8	总体消费
1	103.5	101.2	102	102.4	102.5	102.9	103	101.9	101	100.8	100.7	101.1	食品烟酒
2	102	102	101.6	101.5	101.4	101.4	101.2	101.3	101.2	101.1	101.1	101.1	衣着
3	102.1	102.2	102.1	102.2	102.4	102.5	102.6	102.5	102.4	102.3	102.2	102.2	居住
4	101.2	101.3	101.5	101.4	101.5	101.5	101.6	101.6	101.6	101.5	101.5	101.5	生活服务
5	100.1	98.8	98.7	99.3	101.6	103.2	102.8	102.7	103	102.4	101.8	101.1	交通通信
6	102.4	102.4	102.9	102.3	102.5	102.5	102.2	102.6	102.3	101.8	101.9	102	教育娱乐
7	102.7	102.8	102.7	102.5	102.6	102.6	102.7	104.3	104.6	105	105.1	105.2	医保
8	101.9	102	102.3	101.6	101.5	101.3	100.7	101.2	101.2	100.9	101	100.9	其他

dup_ratio = []
for column in cpi.columns:
    col = cpi[column]
    ratio = col[col.duplicated()].count() / col.count()
    dup_ratio.append(round(ratio, 2))
dup_ratio

[0.0, 0.11, 0.0, 0.0, 0.22, 0.22, 0.0, 0.0, 0.11, 0.0, 0.11, 0.22, 0.0]

dr = pd.Series(dup_ratio, index=cpi.columns)
dr

2019年3月      0.00
2019年2月      0.11
2019年1月      0.00
2018年12月     0.00
2018年11月     0.22
2018年10月     0.22
2018年9月      0.00
2018年8月      0.00
2018年7月      0.11
2018年6月      0.00
2018年5月      0.11
2018年4月      0.22
cpi_index    0.00
dtype: float64

df = pd.DataFrame({"one":[10, 11, 12], 'two':[np.nan, 21, 22], "three":[30, np.nan, 33]})
df

	one	two	three
0	10	NaN	30.0
1	11	21.0	NaN
2	12	22.0	33.0

df = pd.DataFrame({'ColA':[1, np.nan, np.nan, 4, 5, 6, 7], 'ColB':[1, 1, 1, 1, 2, 2, 2]})
df['ColA'].fillna(method='ffill')

0    1.0
1    1.0
2    1.0
3    4.0
4    5.0
5    6.0
6    7.0
Name: ColA, dtype: float64

2-4 处理离群数据

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data//experiment.csv", index_col=0)

fig, ax = plt.subplots()
ax.scatter(df['alpha'], df['belta'])

<matplotlib.collections.PathCollection at 0x17285178520>

png

# https://github.com/mwaskom/seaborn-data.git
# 如果报错，请去官网下载并解压到 seaborn-data文件夹下，一般目录为 C:\Users\23859\seaborn-data
import seaborn as sns
sns.set(style="whitegrid")

tips = sns.load_dataset("tips")    #加载数据集
tips.sample(5)

	total_bill	tip	sex	smoker	day	time	size
84	15.98	2.03	Male	No	Thur	Lunch	2
119	24.08	2.92	Female	No	Thur	Lunch	4
168	10.59	1.61	Female	Yes	Sat	Dinner	2
209	12.76	2.23	Female	Yes	Sat	Dinner	2
138	16.00	2.00	Male	Yes	Thur	Lunch	2

画出箱线图，发现存在离散值

sns.boxplot(x="day", y="tip", data=tips, palette="Set3")

<AxesSubplot:xlabel='day', ylabel='tip'>

png

ax = sns.boxplot(x="day", y="tip", data=tips)
ax = sns.swarmplot(x="day", y="tip", data=tips, color=".25")

d:\CS\Apps\anaconda\anaconda3\envs\ai\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 6.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
d:\CS\Apps\anaconda\anaconda3\envs\ai\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 5.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

png

项目案例

让我们看看著名的波士顿房价

# 加载数据集
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names

#为了操作方便，将数据集转化为DataFrame类型
boston_df = pd.DataFrame(boston.data)    
boston_df.columns = columns
boston_df.head()

d:\CS\Apps\anaconda\anaconda3\envs\ai\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
    
  warnings.warn(msg, category=FutureWarning)

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

boston_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
dtypes: float64(13)
memory usage: 51.5 KB

percentlier = boston_df.quantile([0, 0.25, 0.5, 0.75, 1], axis=0)    # ①
IQR = percentlier.iloc[3] - percentlier.iloc[1]
IQR

CRIM         3.595038
ZN          12.500000
INDUS       12.910000
CHAS         0.000000
NOX          0.175000
RM           0.738000
AGE         49.050000
DIS          3.088250
RAD         20.000000
TAX        387.000000
PTRATIO      2.800000
B           20.847500
LSTAT       10.005000
dtype: float64

Q1 = percentlier.iloc[1]    #下四分位
Q3 = percentlier.iloc[3]    #上四分位
(boston_df < (Q1 - 1.5 * IQR)).any()    # ②

CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM          True
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO     True
B           True
LSTAT      False
dtype: bool

(boston_df > (Q3 + 1.5 * IQR)).any()

CRIM        True
ZN          True
INDUS      False
CHAS        True
NOX        False
RM          True
AGE        False
DIS         True
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT       True
dtype: bool

boston_df_out = boston_df[~((boston_df < (Q1 - 1.5 * IQR)) |(boston_df > (Q3 + 1.5 * IQR))).any(axis=1)]
boston_df_out.shape

(274, 13)

# 计算z值
from scipy import stats    #统计专用模块
import numpy as np
rm = boston_df['RM']
z = np.abs(stats.zscore(rm))    # ③
st = boston_df['RM'].std()    # ④
st

0.7026171434153237

threshold = 3 * st   #阈值，不是“阀值”
print(np.where(z > threshold))    # ⑤

(array([ 97,  98, 162, 163, 166, 180, 186, 195, 203, 204, 224, 225, 226,
       232, 233, 253, 257, 262, 267, 280, 283, 364, 365, 367, 374, 384,
       386, 406, 412, 414], dtype=int64),)

rm_in = rm[(z < threshold)]    # ⑥
rm_in.shape

(476,)

超级虚空

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

【数据清理与特征工程】2-数据清理

文章目录

2-0 基本概念

2-1 转化数据类型

案例

Customer Number

Percent Growth

2016&2017

Jan Units

Month-day-year

Active

练习

练习

2-2 处理重复数据

基础知识

项目案例

2-3 处理缺失数据

问题1：为什么有缺失值？

缺失值产生的原因

缺失值类型

完全随机缺失 (MCAR)

随机丢失 (MAR)

不随机丢失 (NMAR)

问题2：如何分析缺失值？

问题3：缺失值需要处理吗？

问题4：缺失值如何填充？

使用值填充

最近邻样本填充

使用模型预测

基础知识

项目案例

2-4 处理离群数据

项目案例