一.高级处理
1.缺失值处理
学习目标
- 目标
- 说明Pandas的缺失值类型
- 应用replace实现数据的替换
- 应用dropna实现缺失值的删除
- 应用fillna实现缺失值的填充
- 应用isnull判断是否有缺失数据NaN
- 应用
- 对电影数据进行缺失值处理
1 如何处理nan
- 判断数据是否为NaN:
- pd.isnull(df),
- pd.notnull(df)
- 处理方式:
- 存在缺失值nan,并且是np.nan:
- 1、删除存在缺失值的:dropna(axis=‘rows’)
- 注:不会修改原数据,需要接受返回值
- 2、替换缺失值:fillna(value, inplace=True)
- value:替换成的值
- inplace:True:会修改原数据,False:不替换修改原数据,生成新的对象
- 1、删除存在缺失值的:dropna(axis=‘rows’)
- 不是缺失值nan,有默认标记的
- 存在缺失值nan,并且是np.nan:
2 电影数据的缺失值处理
- 电影数据文件获取
# 读取电影数据
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
989 Martyrs Horror A young woman's quest for revenge against the ... Pascal Laugier Morjana Alaoui, Mylène Jampanoï, Catherine Bég... 2008 99 7.1 63785 NaN 89.0
990 Selma Biography,Drama,History A chronicle of Martin Luther King's campaign t... Ava DuVernay David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain... 2014 128 7.5 67637 52.07 NaN
2.1 判断缺失值是否存在
- pd.notnull()
pd.notnull(movie)
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 True True True True True True True True True True True True
1 True True True True True True True True True True True True
2 True True True True True True True True True True True True
3 True True True True True True True True True True True True
4 True True True True True True True True True True True True
5 True True True True True True True True True True True True
6 True True True True True True True True True True True True
7 True True True True True True True True True True False True
np.all(pd.notnull(movie))
2.2 存在缺失值nan,并且是np.nan
- 1、删除
pandas删除缺失值,使用dropna的前提是,缺失值的类型必须是np.nan
# 不修改原数据
movie.dropna()
# 可以定义新的变量接受或者用原来的变量名
data = movie.dropna()
- 2、替换缺失值
# 替换存在缺失值的样本的两列
# 替换填充平均值,中位数
# movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)
替换所有缺失值:
for i in movie.columns:
if np.all(pd.notnull(movie[i])) == False:
print(i)
movie[i].fillna(movie[i].mean(), inplace=True)
2.3 不是缺失值nan,有默认标记的
数据是这样的:
wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
以上数据在读取时,可能会报如下错误:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>
解决办法:
# 全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
处理思路分析:
- 1、先替换‘?’为np.nan
- df.replace(to_replace=, value=)
- to_replace:替换前的值
- value:替换后的值
- df.replace(to_replace=, value=)
# 把一些其它值标记的缺失值,替换成np.nan
wis = wis.replace(to_replace='?', value=np.nan)
- 2、在进行缺失值的处理
# 删除
wis = wis.dropna()
3 小结
- isnull、notnull判断是否存在缺失值【知道】
- dropna删除np.nan标记的缺失值【知道】
- fillna填充缺失值【知道】
- replace替换具体某些值【知道】
二.案例实现
1 缺失值处理
In [3]:
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
In [4]:
movie
Out[4]:
Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | 2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
2 | 3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag… | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar… | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
3 | 4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea… | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma… | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
4 | 5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th… | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D… | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … |
995 | 996 | Secret in Their Eyes | Crime,Drama,Mystery | A tight-knit team of rising investigators, alo… | Billy Ray | Chiwetel Ejiofor, Nicole Kidman, Julia Roberts… | 2015 | 111 | 6.2 | 27585 | NaN | 45.0 |
996 | 997 | Hostel: Part II | Horror | Three American college students studying abroa… | Eli Roth | Lauren German, Heather Matarazzo, Bijou Philli… | 2007 | 94 | 5.5 | 73152 | 17.54 | 46.0 |
997 | 998 | Step Up 2: The Streets | Drama,Music,Romance | Romantic sparks occur between two dance studen… | Jon M. Chu | Robert Hoffman, Briana Evigan, Cassie Ventura,… | 2008 | 98 | 6.2 | 70699 | 58.01 | 50.0 |
998 | 999 | Search Party | Adventure,Comedy | A pair of friends embark on a mission to reuni… | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh… | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
999 | 1000 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
1000 rows × 12 columns
In [5]:
np.any(pd.isnull(movie)) # 里面如果有一个缺失值,就返回True
Out[5]:
True
In [6]:
np.all(pd.isnull(movie)) # 里面如果有一个缺失值,就返回False
Out[6]:
False
In [7]:
data = movie.dropna() # dropna():删除缺失值,使用dropna的前提是,缺失值的类型必须是np.nan
In [8]:
np.any(pd.isnull(data)) # 再次判断
Out[8]:
False
In [9]:
movie["Revenue (Millions)"].head(10)
# 发现存在缺省值
Out[9]:
0 333.13
1 126.46
2 138.12
3 270.32
4 325.02
5 45.13
6 151.06
7 NaN
8 8.01
9 100.01
Name: Revenue (Millions), dtype: float64
In [10]:
#用平均值替换缺失,mean():平均值
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace = True)
In [11]:
movie["Revenue (Millions)"].head(10)
Out[11]:
0 333.130000
1 126.460000
2 138.120000
3 270.320000
4 325.020000
5 45.130000
6 151.060000
7 82.956376
8 8.010000
9 100.010000
Name: Revenue (Millions), dtype: float64
In [12]:
# 替换每一列的缺省值
for i in movie.columns:
print(i)
if np.any(pd.isnull(movie[i])) == True:
print(i)
movie[i].fillna(value = movie[i].mean(),inplace=True)
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
Metascore
In [13]:
# 再次检查是否有缺失值
# 里面如果有一个缺失值,就返回True
np.any(pd.isnull(movie))
Out[13]:
False
In [14]:
wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
wis.head()
In [15]:
wis = wis.to_replace(to_replace = "?",value = np.NAN)