机器学习4：——Pandas——4：高级处理1:缺失值处理

最新推荐文章于 2024-09-15 22:31:42 发布

那就叫老王吧

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量388

点赞数

分类专栏：机器学习文章标签： python 人工智能机器学习 pandas

本文链接：https://blog.csdn.net/Dihuib/article/details/120481971

版权

机器学习专栏收录该内容

10 篇文章 3 订阅

订阅专栏

一.高级处理

1.缺失值处理

学习目标

目标
- 说明Pandas的缺失值类型
- 应用replace实现数据的替换
- 应用dropna实现缺失值的删除
- 应用fillna实现缺失值的填充
- 应用isnull判断是否有缺失数据NaN
应用
- 对电影数据进行缺失值处理

在这里插入图片描述

1 如何处理nan

判断数据是否为NaN：
- pd.isnull(df),
- pd.notnull(df)
处理方式：
- 存在缺失值nan,并且是np.nan:
  - 1、删除存在缺失值的:dropna(axis=‘rows’)
    - 注：不会修改原数据，需要接受返回值
  - 2、替换缺失值:fillna(value, inplace=True)
    - value:替换成的值
    - inplace:True:会修改原数据，False:不替换修改原数据，生成新的对象
- 不是缺失值nan，有默认标记的

2 电影数据的缺失值处理

电影数据文件获取

# 读取电影数据
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
989    Martyrs    Horror    A young woman's quest for revenge against the ...    Pascal Laugier    Morjana Alaoui, Mylène Jampanoï, Catherine Bég...    2008    99    7.1    63785    NaN    89.0
990    Selma    Biography,Drama,History    A chronicle of Martin Luther King's campaign t...    Ava DuVernay    David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...    2014    128    7.5    67637    52.07    NaN

2.1 判断缺失值是否存在

pd.notnull()

pd.notnull(movie)
Rank    Title    Genre    Description    Director    Actors    Year    Runtime (Minutes)    Rating    Votes    Revenue (Millions)    Metascore
0    True    True    True    True    True    True    True    True    True    True    True    True
1    True    True    True    True    True    True    True    True    True    True    True    True
2    True    True    True    True    True    True    True    True    True    True    True    True
3    True    True    True    True    True    True    True    True    True    True    True    True
4    True    True    True    True    True    True    True    True    True    True    True    True
5    True    True    True    True    True    True    True    True    True    True    True    True
6    True    True    True    True    True    True    True    True    True    True    True    True
7    True    True    True    True    True    True    True    True    True    True    False    True
np.all(pd.notnull(movie))

2.2 存在缺失值nan,并且是np.nan

1、删除

pandas删除缺失值，使用dropna的前提是，缺失值的类型必须是np.nan

# 不修改原数据
movie.dropna()

# 可以定义新的变量接受或者用原来的变量名
data = movie.dropna()

2、替换缺失值

# 替换存在缺失值的样本的两列
# 替换填充平均值，中位数
# movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)

替换所有缺失值：

for i in movie.columns:
    if np.all(pd.notnull(movie[i])) == False:
        print(i)
        movie[i].fillna(movie[i].mean(), inplace=True)

2.3 不是缺失值nan，有默认标记的

数据是这样的：

在这里插入图片描述

wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")

以上数据在读取时，可能会报如下错误：

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

解决办法：

# 全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

处理思路分析：

1、先替换‘?’为np.nan
- df.replace(to_replace=, value=)
  - to_replace:替换前的值
  - value:替换后的值

# 把一些其它值标记的缺失值，替换成np.nan
wis = wis.replace(to_replace='?', value=np.nan)

2、在进行缺失值的处理

# 删除
wis = wis.dropna()

3 小结

isnull、notnull判断是否存在缺失值【知道】
dropna删除np.nan标记的缺失值【知道】
fillna填充缺失值【知道】
replace替换具体某些值【知道】

二.案例实现

1 缺失值处理

In [3]:

movie = pd.read_csv("./data/IMDB-Movie-Data.csv")

In [4]:

movie

Out[4]:

	Rank	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
0	1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced …	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S…	2014	121	8.1	757074	333.13	76.0
1	2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te…	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa…	2012	124	7.0	485820	126.46	65.0
2	3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag…	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar…	2016	117	7.3	157606	138.12	62.0
3	4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea…	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma…	2016	108	7.2	60545	270.32	59.0
4	5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th…	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D…	2016	123	6.2	393727	325.02	40.0
…	…	…	…	…	…	…	…	…	…	…	…	…
995	996	Secret in Their Eyes	Crime,Drama,Mystery	A tight-knit team of rising investigators, alo…	Billy Ray	Chiwetel Ejiofor, Nicole Kidman, Julia Roberts…	2015	111	6.2	27585	NaN	45.0
996	997	Hostel: Part II	Horror	Three American college students studying abroa…	Eli Roth	Lauren German, Heather Matarazzo, Bijou Philli…	2007	94	5.5	73152	17.54	46.0
997	998	Step Up 2: The Streets	Drama,Music,Romance	Romantic sparks occur between two dance studen…	Jon M. Chu	Robert Hoffman, Briana Evigan, Cassie Ventura,…	2008	98	6.2	70699	58.01	50.0
998	999	Search Party	Adventure,Comedy	A pair of friends embark on a mission to reuni…	Scot Armstrong	Adam Pally, T.J. Miller, Thomas Middleditch,Sh…	2014	93	5.6	4881	NaN	22.0
999	1000	Nine Lives	Comedy,Family,Fantasy	A stuffy businessman finds himself trapped ins…	Barry Sonnenfeld	Kevin Spacey, Jennifer Garner, Robbie Amell,Ch…	2016	87	5.3	12435	19.64	11.0

1000 rows × 12 columns

In [5]:

np.any(pd.isnull(movie))  # 里面如果有一个缺失值,就返回True

Out[5]:

True

In [6]:

np.all(pd.isnull(movie))  # 里面如果有一个缺失值,就返回False

Out[6]:

False

In [7]:

data = movie.dropna() # dropna():删除缺失值,使用dropna的前提是，缺失值的类型必须是np.nan

In [8]:

np.any(pd.isnull(data)) # 再次判断

Out[8]:

False

In [9]:

movie["Revenue (Millions)"].head(10)
# 发现存在缺省值

Out[9]:

0    333.13
1    126.46
2    138.12
3    270.32
4    325.02
5     45.13
6    151.06
7       NaN
8      8.01
9    100.01
Name: Revenue (Millions), dtype: float64

In [10]:

#用平均值替换缺失,mean():平均值
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace = True)

In [11]:

movie["Revenue (Millions)"].head(10)

Out[11]:

0    333.130000
1    126.460000
2    138.120000
3    270.320000
4    325.020000
5     45.130000
6    151.060000
7     82.956376
8      8.010000
9    100.010000
Name: Revenue (Millions), dtype: float64

In [12]:

# 替换每一列的缺省值
for i in movie.columns:
    print(i)
    if np.any(pd.isnull(movie[i])) ==  True:
        print(i)
        movie[i].fillna(value = movie[i].mean(),inplace=True)
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
Metascore

In [13]:

# 再次检查是否有缺失值
# 里面如果有一个缺失值,就返回True
np.any(pd.isnull(movie))

Out[13]:

False

In [14]:

wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
wis.head()

In [15]:

wis = wis.to_replace(to_replace = "?",value = np.NAN)

那就叫老王吧

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录