pandas学习-第六章

最新推荐文章于 2022-06-15 19:04:24 发布

lx12633036

最新推荐文章于 2022-06-15 19:04:24 发布

阅读量357

点赞数

文章标签： python 数据分析大数据

本文链接：https://blog.csdn.net/lx12633036/article/details/106911107

版权

缺失数据

import pandas as pd 
import numpy as np
data=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\table_missing.csv')
data.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
4	S_1	C_1	1105.0	NaN	street_4	159	64.0	84.8	A-

缺失观测及其类型

了解缺失信息

isna和notna方法

data['Physics'].isna().head()

0    False
1    False
2    False
3     True
4    False
Name: Physics, dtype: bool

data['Physics'].notna().head()

0     True
1     True
2     True
3    False
4     True
Name: Physics, dtype: bool

对表格使用会返回整体的布尔值

data.isna().head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	False	False	True	False	False	False	True	False	False
1	False	False	True	False	False	False	True	False	False
2	False	False	False	False	False	False	True	False	False
3	False	True	True	False	False	False	False	False	True
4	False	False	False	True	False	False	False	False	False

对表格空值进行汇总

data.isna().sum()

School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    31 non-null     object 
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object 
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

个人比较倾向是用info语句，这样不仅仅可以看出缺失值还可以获得数据类型
isna和isnull的关系

pd.isna==pd.isna

True

从以上我们可以看出isna就是isnull

查看缺失值的

data[data['Physics'].isna()]

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
8	S_1	C_2	1204.0	F	street_5	162	63.0	33.8	NaN
13	S_1	C_3	1304.0	NaN	street_2	195	70.0	85.2	NaN
22	S_2	C_2	2203.0	M	street_4	155	91.0	73.8	NaN

挑出所有非缺失值的列

使用all就是全部非缺失值，如果是any就是至少有一个不是缺失值

data[data.notna().all(1)]

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
5	S_1	C_2	1201.0	M	street_5	159	68.0	97.0	A-
6	S_1	C_2	1202.0	F	street_4	176	94.0	63.5	B-
12	S_1	C_3	1303.0	M	street_7	188	82.0	49.7	B
17	S_2	C_1	2103.0	M	street_4	157	61.0	52.5	B-
21	S_2	C_2	2202.0	F	street_7	194	77.0	68.5	B+
25	S_2	C_3	2301.0	F	street_4	157	78.0	72.3	B+
27	S_2	C_3	2303.0	F	street_7	190	99.0	65.9	C
28	S_2	C_3	2304.0	F	street_6	164	81.0	95.5	A-
29	S_2	C_3	2305.0	M	street_4	187	73.0	48.9	B

三种缺失符号

np.nan

None

NaT

NaT是针对时间序列的缺失值，是Pandas的内置类型，可以完全看做时序版本的np.nan，与自己不等，且使用equals是也会被跳过

Nullable类型与NA符号

Nullable整形

对于该种类型而言，它与原来标记int上的符号区别在于首字母大写：‘Int’

s_original=pd.Series([1,2],dtype='int64')
s_original

0    1
1    2
dtype: int64

s_new=pd.Series([1,2],dtype='Int64')
s_new

0    1
1    2
dtype: Int64

它的好处就在于，其中前面提到的三种缺失值都会被替换为统一的NA符号，且不改变数据类型

s_original[1]=np.nan
s_original

0    1.0
1    NaN
dtype: float64

s_new[1]=np.nan
s_new

0       1
1    <NA>
dtype: Int64

Nullable布尔

对于该种类型而言，作用与上面的类似，记号为boolean

s_original=pd.Series([1,0],dtype='bool')
s_original

0     True
1    False
dtype: bool

s_new=pd.Series([1,0],dtype='boolean')
s_new

0     True
1    False
dtype: boolean

s_original[0] = np.nan
s_original

0    NaN
1    0.0
dtype: float64

s_original[0] = None
s_original

0    False
1    False
dtype: bool

s_new[0] = np.nan
s_new

0     <NA>
1    False
dtype: boolean

string类型

该类型是1.0的一大创新，目的之一就是为了区分开原本含糊不清的object类型，这里将简要地提及string，因为它是第7章的主题内容
它本质上也属于Nullable类型，因为并不会因为含有缺失而改变类型

s=pd.Series(['dog','cat'],dtype='string')
s

0    dog
1    cat
dtype: string

s[0]=np.nan
s

0    <NA>
1     cat
dtype: string

s[0]=None
s

0    <NA>
1     cat
dtype: string

此外，和object类型的一点重要区别就在于，在调用字符方法后，string类型返回的是Nullable类型，object则会根据缺失类型和数据类型而改变m

s1=pd.Series(['a',None,'b'],dtype='string')
s1.str.count('a')

0       1
1    <NA>
2       0
dtype: Int64

s2=pd.Series(['a',None,'b'],dtype='object')
s2.str.count('a')

0    1.0
1    NaN
2    0.0
dtype: float64

s1.str.isdigit()

0    False
1     <NA>
2    False
dtype: boolean

s2.str.isdigit()

0    False
1     None
2    False
dtype: object

NA的特性

逻辑运算

只需看该逻辑运算的结果是否依赖pd.NA的取值，如果依赖，则结果还是NA，如果不依赖，则直接计算结果

True|pd.NA

True

False|pd.NA

<NA>

1|100

False &pd.NA

False

True & pd.NA

<NA>

算数运算符

只有两种情况以外,其余都是NA

pd.NA**0

1**pd.NA

convert_dtypes方法

这个函数的功能往往就是在读取数据时，就把数据列转为Nullable类型

data.dtypes

School      object
Class       object
ID         float64
Gender      object
Address     object
Height       int64
Weight     float64
Math       float64
Physics     object
dtype: object

data.convert_dtypes().dtypes

School      string
Class       string
ID           Int64
Gender      string
Address     string
Height       Int64
Weight       Int64
Math       float64
Physics     string
dtype: object

缺失数据的运算与分组

加号与乘号规则

使用加法时，缺失值为0

s=pd.Series([2,3,np.nan,4])
s.sum()

9.0

使用乘法时，缺失值为1

s.prod()

24.0

使用累计函数时，缺失值自动略过

s.cumsum()

0    2.0
1    5.0
2    NaN
3    9.0
dtype: float64

s.cummax()

0    2.0
1    3.0
2    NaN
3    4.0
dtype: float64

s.pct_change#计算变化率：（后一个值-前一个值）／前一个值

<bound method NDFrame.pct_change of 0    2.0
1    3.0
2    NaN
3    4.0
dtype: float64>

groupby方法中的缺失值

自动忽略为缺失值的组

df_g=pd.DataFrame({'one':['A','B','C','D',np.nan],'two':np.random.randn(5)})
df_g

	one	two
0	A	-1.123148
1	B	-0.610931
2	C	1.662203
3	D	-0.790061
4	NaN	-0.616937

df_g.groupby('one').groups

{'A': Int64Index([0], dtype='int64'),
 'B': Int64Index([1], dtype='int64'),
 'C': Int64Index([2], dtype='int64'),
 'D': Int64Index([3], dtype='int64')}

填充与剔除

fillna方法

值填充与前后向填充（分别与ffill方法和bfill方法等价）

data['Physics'].fillna('missing').head()

0         A+
1         B+
2         B+
3    missing
4         A-
Name: Physics, dtype: object

data['Physics'].fillna(method='ffill').head()

0    A+
1    B+
2    B+
3    B+
4    A-
Name: Physics, dtype: object

和之前相同

填充中的对齐特性

data_f=pd.DataFrame({'A':[1,3,np.nan],'B':[2,4,np.nan],'c':[3,5,np.nan]})
data_f.fillna(data_f.mean())

	A	B	c
0	1.0	2.0	3.0
1	3.0	4.0	5.0
2	2.0	3.0	4.0

返回的结果中没有C，根据对齐特点不会被填充

data_f.fillna(data_f.mean()[['A','B']])

	A	B	c
0	1.0	2.0	3.0
1	3.0	4.0	5.0
2	2.0	3.0	NaN

data_f.fillna(data_f[['A','B']].mean())

	A	B	c
0	1.0	2.0	3.0
1	3.0	4.0	5.0
2	2.0	3.0	NaN

data_f.fillna(data_f.mean()[['A','B']])

	A	B	c
0	1.0	2.0	3.0
1	3.0	4.0	5.0
2	2.0	3.0	NaN

dropna方法

axis参数

df_d=pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,3,2],'C':[3,2,1]})
df_d

	A	B	C
0	NaN	NaN	3
1	NaN	3.0	2
2	NaN	2.0	1

df_d.dropna(axis=0)#删除有缺失值的行

	A	B	C

df_d.dropna(axis=1)

	C
0	3
1	2
2	1

how参数（可以选all或者any，表示全为缺失去除和存在缺失去除

subset参数（即在某一组列范围中搜索缺失值）

这一部分与之前一样，不需要赘述

插值

线性插值

索引状态无关

默认状态下，interpolate会对缺失的值进行线性插值

s=pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s

0     1.0
1    10.0
2    15.0
3    -5.0
4    -2.0
5     NaN
6     NaN
7    28.0
dtype: float64

s.interpolate()

0     1.0
1    10.0
2    15.0
3    -5.0
4    -2.0
5     8.0
6    18.0
7    28.0
dtype: float64

s.interpolate().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x261a0e64988>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZtanDBUv-1592834890647)(output_100_1.png)]

s.index=np.sort(np.random.randint(50,300,8))
s.interpolate()

75      1.0
108    10.0
130    15.0
133    -5.0
145    -2.0
155     8.0
194    18.0
208    28.0
dtype: float64

s.interpolate().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x261a3a46f08>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cNORfHpv-1592834890650)(output_102_1.png)]

可以看出此时的插值与索引无关，获得的插值不变

与索引有关的插值

method中的index和time选项可以使插值线性地依赖索引，即插值为索引的线性函数

s.interpolate(method='index').plot()

<matplotlib.axes._subplots.AxesSubplot at 0x261a3740688>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pxHCParN-1592834890651)(output_105_1.png)]

问题与练习

问题

如何删除缺失值占比超过25%的列？

#通过给定的数据集进行说明
data.info()#得到data一共35个值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    31 non-null     object 
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object 
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

data.isna().sum()/35#得到weight 缺失值超过25%
#这里不能使用data.count(),缺失值会跳过！

School     0.000000
Class      0.114286
ID         0.171429
Gender     0.200000
Address    0.000000
Height     0.000000
Weight     0.371429
Math       0.142857
Physics    0.114286
dtype: float64

data.drop(data[['Gender','Weight']],axis=1)

	School	Class	ID	Address	Height	Math	Physics
0	S_1	C_1	NaN	street_1	173	34.0	A+
1	S_1	C_1	NaN	street_2	192	32.5	B+
2	S_1	C_1	1103.0	street_2	186	87.2	B+
3	S_1	NaN	NaN	street_2	167	80.4	NaN
4	S_1	C_1	1105.0	street_4	159	84.8	A-
5	S_1	C_2	1201.0	street_5	159	97.0	A-
6	S_1	C_2	1202.0	street_4	176	63.5	B-
7	S_1	C_2	NaN	street_6	160	58.8	A+
8	S_1	C_2	1204.0	street_5	162	33.8	NaN
9	S_1	C_2	1205.0	street_6	167	68.4	B-
10	S_1	C_3	1301.0	street_4	161	NaN	B+
11	S_1	NaN	1302.0	street_1	175	87.7	A-
12	S_1	C_3	1303.0	street_7	188	49.7	B
13	S_1	C_3	1304.0	street_2	195	85.2	NaN
14	S_1	C_3	NaN	street_5	187	61.7	B-
15	S_2	C_1	2101.0	street_7	159	NaN	C
16	S_2	C_1	2102.0	street_6	161	50.6	B+
17	S_2	C_1	2103.0	street_4	157	52.5	B-
18	S_2	NaN	2104.0	street_5	159	72.2	B+
19	S_2	C_1	2105.0	street_4	170	34.2	A
20	S_2	C_2	2201.0	street_5	193	NaN	B
21	S_2	C_2	2202.0	street_7	194	68.5	B+
22	S_2	C_2	2203.0	street_4	155	73.8	NaN
23	S_2	C_2	NaN	street_1	175	47.2	B-
24	S_2	C_2	2205.0	street_7	159	NaN	B
25	S_2	C_3	2301.0	street_4	157	72.3	B+
26	S_2	NaN	2302.0	street_5	171	NaN	A
27	S_2	C_3	2303.0	street_7	190	65.9	C
28	S_2	C_3	2304.0	street_6	164	95.5	A-
29	S_2	C_3	2305.0	street_4	187	48.9	B
30	S_2	C_4	2401.0	street_2	159	45.3	A
31	S_2	C_4	2402.0	street_7	166	48.7	B
32	S_2	C_4	2403.0	street_6	158	59.7	B+
33	S_2	C_4	2404.0	street_2	160	67.7	B
34	S_2	C_4	2405.0	street_6	193	47.6	B

#方法二，计算出超过25%的缺失值个数
35*0.75
#即超过8个的去除

26.25

data.isna().sum()#选择weight去除

School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64

什么是Nullable类型？请谈谈为什么要引入这个设计？

它是一种特殊的整形

解决之前因为缺失值而导致的数据类型改变
同一不同类型的缺失值的符号

对于一份有缺失值的数据，可以采取哪些策略或方法深化对它的了解

判断缺失值数量及特征重要性再选择策略
对于不是特别重要的特征且缺失数据量较大，考虑删除特征，dropna或者drop
对于重要特征，那么考虑进行插值，除了本文所说的插值，也可以考虑使用决策树等机器学习方法进行填补

练习

现有一份虚拟数据集，列类型分别为string/浮点/整型，请解决如下问题：

请以列类型读入数据，并选出C为缺失值的行。

data_1=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\Missing_data_one.csv')

data_1.convert_dtypes()

	A	B	C
0	not_NaN	0.922	4
1	not_NaN	0.700	<NA>
2	not_NaN	0.503	8
3	not_NaN	0.938	4
4	not_NaN	0.952	10
5	not_NaN	0.972	<NA>
6	not_NaN	0.572	2
7	not_NaN	0.523	10
8	not_NaN	0.557	10
9	not_NaN	0.695	4
10	not_NaN	0.782	1
11	not_NaN	0.736	<NA>
12	not_NaN	0.706	0
13	not_NaN	0.682	3
14	not_NaN	0.916	8
15	not_NaN	0.935	5
16	not_NaN	0.823	1
17	not_NaN	0.763	2
18	not_NaN	0.976	5
19	not_NaN	0.684	<NA>
20	not_NaN	0.935	2
21	not_NaN	0.913	<NA>
22	not_NaN	0.538	5
23	not_NaN	0.552	2
24	not_NaN	0.892	5
25	not_NaN	0.891	7
26	not_NaN	0.960	2
27	not_NaN	0.799	6
28	not_NaN	0.577	0
29	not_NaN	0.801	4

data_1.dtypes

A     object
B    float64
C    float64
dtype: object

data_1[data_1['C'].isna()]
#

	A	B	C
1	not_NaN	0.700	NaN
5	not_NaN	0.972	NaN
11	not_NaN	0.736	NaN
19	not_NaN	0.684	NaN
21	not_NaN	0.913	NaN

现需要将A中的部分单元转为缺失值，单元格中的最小转换概率为25%，且概率大小与所在行B列单元的值成正比。

total_b=data_1['B'].sum()
total_b

23.194999999999997

min_b=data_1['B'].min()

data_1['A']=pd.Series(list(zip(data_1['A'].values,
                              data_1['B'].values))).apply(lambda x:x[0] 
                                                         if np.random.rand()>0.25*x[1]/min_b
                                                         else np.nan)
data_1.head()

	A	B	C
0	NaN	0.922	4.0
1	NaN	0.700	NaN
2	not_NaN	0.503	8.0
3	NaN	0.938	4.0
4	NaN	0.952	10.0

现有一份缺失的数据集，记录了36个人来自的地区、身高、体重、年龄和工资，请解决如下问题

统计各列缺失的比例并选出在后三列中至少有两个非缺失值的行。

data_2=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\Missing_data_two.csv')

data_2.head()

	编号	地区	身高	体重	年龄	工资
0	1	A	157.50	NaN	47.0	15905.0
1	2	B	202.00	91.80	25.0	NaN
2	3	C	169.09	62.18	NaN	NaN
3	4	A	166.61	59.95	77.0	5434.0
4	5	B	185.19	NaN	62.0	4242.0

data_2.info() #一共36个值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   编号      36 non-null     int64  
 1   地区      36 non-null     object 
 2   身高      36 non-null     float64
 3   体重      28 non-null     float64
 4   年龄      27 non-null     float64
 5   工资      28 non-null     float64
dtypes: float64(4), int64(1), object(1)
memory usage: 1.8+ KB

data_2.isna().sum()/36
#参考答案用的shape[0]，之前学习的也是可以使用列索引或者len进行，这两个方法更好

编号    0.000000
地区    0.000000
身高    0.000000
体重    0.222222
年龄    0.250000
工资    0.222222
dtype: float64

data_2[data_2.iloc[:,3:].isna().sum(1)<=1]

	编号	地区	身高	体重	年龄	工资
0	1	A	157.50	NaN	47.0	15905.0
1	2	B	202.00	91.80	25.0	NaN
3	4	A	166.61	59.95	77.0	5434.0
4	5	B	185.19	NaN	62.0	4242.0
5	6	A	187.13	78.42	55.0	13959.0
6	7	C	163.81	57.43	43.0	6533.0
7	8	A	183.80	75.42	48.0	19779.0
8	9	B	179.67	71.70	65.0	8608.0
9	10	C	186.08	77.47	65.0	12433.0
10	11	B	163.41	57.07	NaN	6495.0
13	14	B	175.99	68.39	NaN	13130.0
15	16	A	165.68	NaN	46.0	13683.0
16	17	B	166.48	59.83	31.0	17673.0
17	18	C	191.62	82.46	NaN	12447.0
18	19	A	172.83	65.55	23.0	13768.0
19	20	B	156.99	51.29	62.0	3054.0
20	21	C	200.22	90.20	41.0	NaN
21	22	A	154.63	49.17	35.0	14559.0
22	23	B	157.87	52.08	67.0	7398.0
23	24	A	165.55	NaN	66.0	19890.0
24	25	C	181.78	73.60	63.0	11383.0
25	26	A	164.43	57.99	34.0	19899.0
27	28	C	172.39	65.15	43.0	10362.0
28	29	B	162.12	55.91	NaN	13362.0
29	30	A	183.73	75.36	58.0	8270.0
30	31	C	181.19	NaN	41.0	12616.0
31	32	B	167.28	60.55	64.0	18317.0
34	35	B	170.12	63.11	77.0	7398.0
35	36	C	180.47	72.42	78.0	9554.0

请结合身高列和地区列中的数据，对体重进行合理插值。

data_21=data_2.copy()

data_missing=data_21[data_21['体重'].isna()]

data_missing.groupby('地区').head()

	编号	地区	身高	体重	年龄	工资
0	1	A	157.50	NaN	47.0	15905.0
4	5	B	185.19	NaN	62.0	4242.0
12	13	C	177.37	NaN	79.0	NaN
15	16	A	165.68	NaN	46.0	13683.0
23	24	A	165.55	NaN	66.0	19890.0
26	27	B	158.28	NaN	51.0	NaN
30	31	C	181.19	NaN	41.0	12616.0
32	33	C	181.01	NaN	NaN	13021.0

总结

这是pandas学习第六章课程，总结如下：

本次学习的知识点很多在上一轮学习中已经有了，所以学习起来的压力并不大
主要的问题还是在综合练习的过程中，只能完成部分习题，还是编程思想的问题。这个也是贯穿在自己的数据处理过程中。
不要怕麻烦，不要想着一行代码解决很多问题。自己可以小步迭代，进行多次步骤，完成需要的效果。
本次学习与上一轮复习同步进行。

lx12633036

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas学习-第六章

缺失数据import pandas as pd import numpy as npdata=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\table_missing.csv')data.head() School Class ID Gender Address Height W
复制链接

扫一扫