Pandas对缺失数据和文本数据的处理

最新推荐文章于 2024-08-08 14:28:23 发布

Miss小姐姐

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量1.7k

点赞数 1

分类专栏： Python学习笔记文章标签： python

本文链接：https://blog.csdn.net/qq_26982913/article/details/106926147

版权

Python学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

我们知道Pandas中最常用的两种数据类型就是Series 和DataFrame，因此主要学习者两种数据类型的缺失数据的处理方法

拿到一份数据，首先查看数据完整性，了解缺失信息；
然后针对缺失的数据，是要进行填充？还是删除？还是插值？选择合适的方式对拿到的数据进行数据预处理。

1. 了解缺失信息：主要就是用isna和notna方法，对于isna方法，如果为缺失值，则返回True，如果不为缺失值，则返回False；notna方法则反之

import numpy as np
import pandas as pd
#首选读取一份数据
df = pd.read_csv('E:/data/table_missing.csv')
#这样子读出来的数据结构是DataFrame
df.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
4	S_1	C_1	1105.0	NaN	street_4	159	64.0	84.8	A-

#然后通过isna方法或者notna方法来查看缺失值
df['Physics'].isna().head()

0    False
1    False
2    False
3     True
4    False
Name: Physics, dtype: bool

有时候对于一份数据，我们并不关心某一个缺失值，而是关心某个需要分析的属性值的数据完整性。如果该属性数据缺失较少，
可以考虑用合适的插值方法来对数据进行拟合（模拟）；如果数据缺失较多，或许该数据数据就不适合应用到我们的分析当中。
可以利用如下方法来了解拿到手的数据的一个完整性情况：

#计算每一列有多少个缺失值
df.isna().sum()

School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64

#也可以通过info函数来查看每一列的数据缺失情况
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    31 non-null     object 
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object 
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

2.处理缺失值

2.1 填充

主要用到的方法是：DataFrame.fillna(value = None,method = None,axis = None, inplace = False,limit=None, downcast=None, **kwargs)
其中，
value是要用去填充缺失值的值；

# 1.直接用missing去填充所有的缺失值
df.fillna('missing').head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1103.0	M	street_1	173	81.0	34.0	A+
1	S_1	C_1	1103.0	F	street_2	192	81.0	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	81.0	87.2	B+
3	S_1	C_1	1105.0	F	street_2	167	81.0	80.4	A-
4	S_1	C_1	1105.0	M	street_4	159	64.0	84.8	A-

# 2.用缺失值上面的值替换缺失值
df.fillna(method= 'ffill').head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	C_1	1103.0	F	street_2	167	81.0	80.4	B+
4	S_1	C_1	1105.0	F	street_4	159	64.0	84.8	A-

# 3.用缺失值下面的值替换缺失值
df.fillna(method='backfill').head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1103.0	M	street_1	173	81.0	34.0	A+
1	S_1	C_1	1103.0	F	street_2	192	81.0	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	81.0	87.2	B+
3	S_1	C_1	1105.0	F	street_2	167	81.0	80.4	A-
4	S_1	C_1	1105.0	M	street_4	159	64.0	84.8	A-

2.2 剔除

主要用的是函数：DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
删除含有空值的行或者列

# 1.删除行,只要哪行有空值就删掉哪行；
df.dropna().head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
5	S_1	C_2	1201.0	M	street_5	159	68.0	97.0	A-
6	S_1	C_2	1202.0	F	street_4	176	94.0	63.5	B-
12	S_1	C_3	1303.0	M	street_7	188	82.0	49.7	B
17	S_2	C_1	2103.0	M	street_4	157	61.0	52.5	B-
21	S_2	C_2	2202.0	F	street_7	194	77.0	68.5	B+

# 2. 删除列，只要哪一列有空值，就删掉哪一列
df.dropna(axis= 1).head()

	School	Address	Height
0	S_1	street_1	173
1	S_1	street_2	192
2	S_1	street_2	186
3	S_1	street_2	167
4	S_1	street_4	159

# 3.某一行所有值全为缺失值才删除
df.dropna(how= 'all').head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
4	S_1	C_1	1105.0	NaN	street_4	159	64.0	84.8	A-

# 4.某一行至少出现过2个缺失值才删除
df.dropna(thresh=2).head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
4	S_1	C_1	1105.0	NaN	street_4	159	64.0	84.8	A-

2.3插值

2.3.1 线性插值

s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s.plot()

在这里插入图片描述

s.interpolate().plot()

在这里插入图片描述

Miss小姐姐

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Pandas对缺失数据和文本数据的处理

Pandas对缺失数据和文本数据的处理我们知道Pandas中最常用的两种数据类型就是Series 和DataFrame，因此主要学习者两种数据类型的缺失数据的处理方法拿到一份数据，首先查看数据完整性，了解缺失信息；然后针对缺失的数据，是要进行填充？还是删除？还是插值？选择合适的方式对拿到的数据进行数据预处理。1. 了解缺失信息：主要就是用isna和notna方法，对于isna方法，如果为缺失值，则返回True，如果不为缺失值，则返回False；notna方法则反之import numpy as n
复制链接

扫一扫