机器学习4:——Pandas——4:高级处理1:缺失值处理

一.高级处理

1.缺失值处理

学习目标

  • 目标
    • 说明Pandas的缺失值类型
    • 应用replace实现数据的替换
    • 应用dropna实现缺失值的删除
    • 应用fillna实现缺失值的填充
    • 应用isnull判断是否有缺失数据NaN
  • 应用
    • 对电影数据进行缺失值处理

在这里插入图片描述

1 如何处理nan

  • 判断数据是否为NaN:
    • pd.isnull(df),
    • pd.notnull(df)
  • 处理方式:
    • 存在缺失值nan,并且是np.nan:
      • 1、删除存在缺失值的:dropna(axis=‘rows’)
        • 注:不会修改原数据,需要接受返回值
      • 2、替换缺失值:fillna(value, inplace=True)
        • value:替换成的值
        • inplace:True:会修改原数据,False:不替换修改原数据,生成新的对象
    • 不是缺失值nan,有默认标记的

2 电影数据的缺失值处理

  • 电影数据文件获取
# 读取电影数据
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
989    Martyrs    Horror    A young woman's quest for revenge against the ...    Pascal Laugier    Morjana Alaoui, Mylène Jampanoï, Catherine Bég...    2008    99    7.1    63785    NaN    89.0
990    Selma    Biography,Drama,History    A chronicle of Martin Luther King's campaign t...    Ava DuVernay    David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...    2014    128    7.5    67637    52.07    NaN

2.1 判断缺失值是否存在

  • pd.notnull()
pd.notnull(movie)
Rank    Title    Genre    Description    Director    Actors    Year    Runtime (Minutes)    Rating    Votes    Revenue (Millions)    Metascore
0    True    True    True    True    True    True    True    True    True    True    True    True
1    True    True    True    True    True    True    True    True    True    True    True    True
2    True    True    True    True    True    True    True    True    True    True    True    True
3    True    True    True    True    True    True    True    True    True    True    True    True
4    True    True    True    True    True    True    True    True    True    True    True    True
5    True    True    True    True    True    True    True    True    True    True    True    True
6    True    True    True    True    True    True    True    True    True    True    True    True
7    True    True    True    True    True    True    True    True    True    True    False    True
np.all(pd.notnull(movie))

2.2 存在缺失值nan,并且是np.nan

  • 1、删除

pandas删除缺失值,使用dropna的前提是,缺失值的类型必须是np.nan

# 不修改原数据
movie.dropna()

# 可以定义新的变量接受或者用原来的变量名
data = movie.dropna()
  • 2、替换缺失值
# 替换存在缺失值的样本的两列
# 替换填充平均值,中位数
# movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)

替换所有缺失值:

for i in movie.columns:
    if np.all(pd.notnull(movie[i])) == False:
        print(i)
        movie[i].fillna(movie[i].mean(), inplace=True)

2.3 不是缺失值nan,有默认标记的

数据是这样的:

在这里插入图片描述

wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")

以上数据在读取时,可能会报如下错误:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

解决办法:

# 全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

处理思路分析:

  • 1、先替换‘?’为np.nan
    • df.replace(to_replace=, value=)
      • to_replace:替换前的值
      • value:替换后的值
# 把一些其它值标记的缺失值,替换成np.nan
wis = wis.replace(to_replace='?', value=np.nan)
  • 2、在进行缺失值的处理
# 删除
wis = wis.dropna()

3 小结

  • isnull、notnull判断是否存在缺失值【知道】
  • dropna删除np.nan标记的缺失值【知道】
  • fillna填充缺失值【知道】
  • replace替换具体某些值【知道】

二.案例实现

1 缺失值处理

In [3]:

movie = pd.read_csv("./data/IMDB-Movie-Data.csv")

In [4]:

movie

Out[4]:

RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0
995996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo…Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts…20151116.227585NaN45.0
996997Hostel: Part IIHorrorThree American college students studying abroa…Eli RothLauren German, Heather Matarazzo, Bijou Philli…2007945.57315217.5446.0
997998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen…Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,…2008986.27069958.0150.0
998999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni…Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh…2014935.64881NaN22.0
9991000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins…Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch…2016875.31243519.6411.0

1000 rows × 12 columns

In [5]:

np.any(pd.isnull(movie))  # 里面如果有一个缺失值,就返回True

Out[5]:

True

In [6]:

np.all(pd.isnull(movie))  # 里面如果有一个缺失值,就返回False

Out[6]:

False

In [7]:

data = movie.dropna() # dropna():删除缺失值,使用dropna的前提是,缺失值的类型必须是np.nan

In [8]:

np.any(pd.isnull(data)) # 再次判断

Out[8]:

False

In [9]:

movie["Revenue (Millions)"].head(10)
# 发现存在缺省值

Out[9]:

0    333.13
1    126.46
2    138.12
3    270.32
4    325.02
5     45.13
6    151.06
7       NaN
8      8.01
9    100.01
Name: Revenue (Millions), dtype: float64

In [10]:

#用平均值替换缺失,mean():平均值
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace = True)

In [11]:

movie["Revenue (Millions)"].head(10)

Out[11]:

0    333.130000
1    126.460000
2    138.120000
3    270.320000
4    325.020000
5     45.130000
6    151.060000
7     82.956376
8      8.010000
9    100.010000
Name: Revenue (Millions), dtype: float64

In [12]:

# 替换每一列的缺省值
for i in movie.columns:
    print(i)
    if np.any(pd.isnull(movie[i])) ==  True:
        print(i)
        movie[i].fillna(value = movie[i].mean(),inplace=True)
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
Metascore

In [13]:

# 再次检查是否有缺失值
# 里面如果有一个缺失值,就返回True
np.any(pd.isnull(movie))

Out[13]:

False

In [14]:

wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
wis.head()

In [15]:

wis = wis.to_replace(to_replace = "?",value = np.NAN)
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值