Python基础知识——Pandas数据分析篇

最新推荐文章于 2024-09-09 15:48:35 发布

若拂雪色

最新推荐文章于 2024-09-09 15:48:35 发布

阅读量912

点赞数

文章标签： python pandas 数据分析

本文链接：https://blog.csdn.net/qq_52670137/article/details/128548832

版权

本文详细介绍了Pandas的三种数据结构Series、DataFrame和Panel，展示了如何创建、索引和操作这些数据结构。此外，还讨论了DataFrame的基本功能，如算术运算、函数应用、映射运算、排序、迭代以及描述性统计。最后，文章探讨了如何检测和处理数据中的缺失值，包括删除法、固定值替换法、填充法和插值法。

摘要由CSDN通过智能技术生成

Pandas数据结构

Pandas有3种数据结构：系列（Series）、数据帧（DataFrame）和面板（Panel），这些数据结构可以构建在NumPy数组之上
1、Series(系列)——具有均匀数据的一维数组结构 pandas.Series(data, index, dtype, copy)
2、DataFrame(数据帧)——具有异构数据的二维数组 pandas.DataFrame(data, index, coloumns, dtype, copy)
3、Panel(面板)——具有异构数据的三维数据结构 pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

Series

# Series
import pandas as pd
import numpy as np
data = np.array(['a', 'b', 'c', 'd', 'e'])
s1 = pd.Series(data)
print('默认索引：')
print(s1)
print('#####################')

s2 = pd.Series(data, index=[100, 101, 102, 103, 200]) # 指定索引
print('指定索引：')
print(s2)
print('#####################')

print('索引s1[0:2]=') # 左闭右开
print(s1[0:2])
print('#####################')

print('索引s2[[100, 102]]=')
print(s2[[100, 102]]) # 进行选择
print('#####################')

print('索引s2[100]=', s2[100])
print('#####################')

print("系列中修改c的元素组成的系列：s1[s1<'c']=")
print(s1[s1<'c'])
print('#####################')

print('s2.index=', s2.index)

## 输出 ##
默认索引：
0    a
1    b
2    c
3    d
4    e
dtype: object
#####################
指定索引：
100    a
101    b
102    c
103    d
200    e
dtype: object
#####################
索引s1[0:2]=
0    a
1    b
dtype: object
#####################
索引s2[[100, 102]]=
100    a
102    c
dtype: object
#####################
索引s2[100]= a
#####################
系列中修改c的元素组成的系列：s1[s1<'c']=
0    a
1    b
dtype: object
#####################
s2.index= Int64Index([100, 101, 102, 103, 200], dtype='int64')

DataFrame

# DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame()
print('创建空数据帧：')
print(df)
print('#####################')

data = np.arange(11, 15)
df1 = pd.DataFrame(data)
print('df1=')
print(df1)
print('#####################')

data = {'name':['小明','小花','小兰','小胜'], 'gender':['男','女','女','男']}
df2 = pd.DataFrame(data)
print('df2=\n',df2)
print('#####################')

data = [{'name':'小明', 'gender':'男'}, {'name':'小花', 'gender':'女', 'age':32}]
df3 = pd.DataFrame(data, index=['1', '2'], columns=['name', 'gender', 'age']) # 设置行列名
print('df3=')
print(df3)
print('#####################')

d = {'a':pd.Series(np.arange(3), index=['1', '2', '3']),
    'b':pd.Series(np.arange(4), index=['1', '2', '4', '5'])} # 将Series嵌套调用进DataFrame，索引中为空的值直接补'NaN'
df4 = pd.DataFrame(d)
print('df4=')
print(df4)
print('#####################')

print('df2.index=', df2.index)
print('df2.columns=', df2.columns)
print('#####################')

## 输出 ##
创建空数据帧：
Empty DataFrame
Columns: []
Index: []
#####################
df1=
    0
0  11
1  12
2  13
3  14
#####################
df2=
   name gender
0   小明      男
1   小花      女
2   小兰      女
3   小胜      男
#####################
df3=
  name gender   age
1   小明      男   NaN
2   小花      女  32.0
#####################
df4=
     a    b
1  0.0  0.0
2  1.0  1.0
3  2.0  NaN
4  NaN  2.0
5  NaN  3.0
#####################
df2.index= RangeIndex(start=0, stop=4, step=1)
df2.columns= Index(['name', 'gender'], dtype='object')
#####################

Panel

# Panel
import pandas as pd 
import numpy as np
data = np.random.rand(2, 4, 5) # rand——服从[0,1)之间的均匀分布，randn——服从标准正态分布
p = pd.Panel(data)
print('第一个p=')
print(p)
print('#####################')

data = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
       'Item2': pd.DataFrame(np.random.randn(4, 3))}
p = pd.Panel(data)
print('第二个p=')
print(p)
print("p['Item1']=")
print(p['Item1'])

## 输出 ##
第一个p=
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
#####################
第二个p=
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
p['Item1']=
          0         1         2
0 -1.216973 -0.804201  0.612168
1 -0.607057  0.694872 -0.113100
2 -2.127714 -0.560315  2.004696
3 -0.013660  0.894818 -0.642299

DataFrame基本功能

DataFrame的基本功能包括数据帧的重要属性和方法。
查看DataFrame对象的小样本，可使用head()和tail()方法，head()返回前n行（观察索引值），显示元素默认为5。tail()返回后n行。
df.T——转置；df.axes——轴序列；df.dtypes——数据类型；df.empty——是否为空；df.ndim——维度；df.shape——形状；df.size——元素数量；df.values——实际数据的NumPy表示；

import pandas as pd
import numpy as np
d = {'name': pd.Series(['小明','小花','小兰','小胜']),
    'gender': pd.Series(['男','女','女','男']),
    'age': pd.Series([20, 22, 19, 23]),
    'calss': pd.Series(['1班','1班','2班','1班'])}

df = pd.DataFrame(d)
print('原数据帧：')
print(df)
print('#####################')

print('转置：')
print(df.T)
print('#####################')

print('轴序列：', df.axes)
print('数据类型：')
print(df.dtypes)
print('是否为空', df.empty)
print('维度', df.ndim)
print('形状', df.shape)
print('元素数量', df.size)
print('实际数据的NumPy表示：')
print(df.values)
print('#####################')

print('前3行数据：')
print(df.head(3))
print('后3行数据：')
print(df.tail(3))

## 输出 ##
原数据帧：
  name gender  age calss
0   小明      男   20    1班
1   小花      女   22    1班
2   小兰      女   19    2班
3   小胜      男   23    1班
#####################
转置：
         0   1   2   3
name    小明  小花  小兰  小胜
gender   男   女   女   男
age     20  22  19  23
calss   1班  1班  2班  1班
#####################
轴序列： [RangeIndex(start=0, stop=4, step=1), Index(['name', 'gender', 'age', 'calss'], dtype='object')]
数据类型：
name      object
gender    object
age        int64
calss     object
dtype: object
是否为空 False
维度 2
形状 (4, 4)
元素数量 16
实际数据的NumPy表示：
[['小明' '男' 20 '1班']
 ['小花' '女' 22 '1班']
 ['小兰' '女' 19 '2班']
 ['小胜' '男' 23 '1班']]
#####################
前3行数据：
  name gender  age calss
0   小明      男   20    1班
1   小花      女   22    1班
2   小兰      女   19    2班
后3行数据：
  name gender  age calss
1   小花      女   22    1班
2   小兰      女   19    2班
3   小胜      男   23    1班

Pandas数据运算

算术运算

Pandas的Series/DataFrame数据对象在进行算术运算时，如果有相同索引名，则对相同索引的数据进行运算；如果没有，则引入缺失值。
Pandas的Series和DataFrame对象也可以进行算术运算，因为维度不同，所以运算遵循广播规则。

系列(Series)算术运算

import pandas as pd
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([11, 12, 13], index=['a', 'b', 'd'])
print('s1=')
print(s1)

print('s2=')
print(s2)

print('s1+s2=')
print(s1+s2) # 如果有相同索引名，则对相同索引的数据进行运算；如果没有，则引入缺失值

s3 = pd.Series([1, 2, 3])
s4 = pd.Series([11, 12, 13])
print('s3+s4=')
print(s3+s4)

## 输出 ##
s1=
a    1
b    2
c    3
dtype: int64
s2=
a    11
b    12
d    13
dtype: int64
s1+s2=
a    12.0
b    14.0
c     NaN
d     NaN
dtype: float64
s3+s4=
0    12
1    14
2    16
dtype: int64

数据帧(DataFrame)算术运算

import pandas as pd
d1 = {'a':[1, 2],
     'b': [3, 4],
     'c': [5, 6]}
df1 = pd.DataFrame(d1)

d2 = {'a': [11, 12],
     'b': [13, 14],
     'd': [15, 16]}
df2 = pd.DataFrame(d2)

print(df1+df2) # 如果有相同索引名，则对相同索引的数据进行运算；如果没有，则引入缺失值

## 输出 ##
    a   b   c   d
0  12  16 NaN NaN
1  14  18 NaN NaN

系列(Series)和数据帧(DataFrame)算术运算

import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
d = {'a': [1, 2],
    'b': [3, 4],
    'c': [5, 6]}
df = pd.DataFrame(d)
print(df+s)

## 输出 ##
   a  b  c
0  2  5  8
1  3  6  9

函数应用与映射运算

函数应用与映射运算的作用是将其他函数或者自定义函数应用于Pandas对象，主要包括：pipe()、apply()、applymap()和map()。
函数pipe()将其他函数套用在整个DataFrame函数上；
函数apply()将其他函数套用到DataFrame的行或列上；
函数applymap()将其他函数套用到DataFrame的每一个元素上；
函数map()将其他函数套用到Series的每个元素中，DataFrame的行或者列都是Series对象。

pipe()函数应用

# pipe()函数应用
import pandas as pd
import numpy as np
def fun(ele):
    return ele*2
data = np.arange(9).reshape(3, 3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
print(df.pipe(fun))

## 输出 ##
    a   b   c
0   0   2   4
1   6   8  10
2  12  14  16

apply()函数应用

# apply()函数应用
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['a', 'b', 'c', 'd'])
print(df)
print(df.apply(np.mean))
print(df.apply(np.mean, axis=1))
print(df.apply(lambda x: np.sum(x)))

## 输出 ##
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
a    4.0
b    5.0
c    6.0
d    7.0
dtype: float64
0    1.5
1    5.5
2    9.5
dtype: float64
a    12
b    15
c    18
d    21
dtype: int64

applymap()函数应用

# applymap()函数应用
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['a','b','c','d'])
print(df)
print(df.applymap(lambda x: np.power(x, 2)))

## 输出 ##
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
    a   b    c    d
0   0   1    4    9
1  16  25   36   49
2  64  81  100  121

排序

Pandas有两种排序方式，分别是：按标签排序和按实际值排序。
1、sort_index()函数——按标签排序
2、sort_values()函数——按实际值排序

import pandas as pd
import numpy as np
data = np.array([[2, 5, 3, 7], [16, 14, 2, 16], [29, 27, 2, 25]])
df = pd.DataFrame(data, index=[0, 4, 2], columns=['a','c','b','d'])
print('原数据帧：')
print(df)

print('按行索引升序：')
print(df.sort_index())

print('按行索引降序：')
print(df.sort_index(ascending=False))

print('按列索引升序：')
print(df.sort_index(axis=1))

print('按列索引降序：')
print(df.sort_index(axis=1, ascending=False))

print('#############')
print('按b排序：')
print(df.sort_values(by='b'))
print('先按b排序，若b相同再按c排序')
print(df.sort_values(by=['b', 'c'])) # 先按b排序，若b相同再按c排序

## 输出 ##
原数据帧：
    a   c  b   d
0   2   5  3   7
4  16  14  2  16
2  29  27  2  25
按行索引升序：
    a   c  b   d
0   2   5  3   7
2  29  27  2  25
4  16  14  2  16
按行索引降序：
    a   c  b   d
4  16  14  2  16
2  29  27  2  25
0   2   5  3   7
按列索引升序：
    a  b   c   d
0   2  3   5   7
4  16  2  14  16
2  29  2  27  25
按列索引降序：
    d   c  b   a
0   7   5  3   2
4  16  14  2  16
2  25  27  2  29
#############
按b排序：
    a   c  b   d
4  16  14  2  16
2  29  27  2  25
0   2   5  3   7
先按b排序，若b相同再按c排序
    a   c  b   d
4  16  14  2  16
2  29  27  2  25
0   2   5  3   7

迭代

迭代DataFrame的列

DataFrame迭代

# DataFrame迭代
import pandas as pd
import numpy as np
data = np.array([[2, 5, 3, 7], [16, 14, 2, 16], [29, 27, 2, 25]])
df = pd.DataFrame(data, index=[0, 4, 2], columns=['a','c','b','d'])
print(df)

for col in df: # 找到列索引
    print(col)
    print(df[col])

## 输出 ##
    a   c  b   d
0   2   5  3   7
4  16  14  2  16
2  29  27  2  25
a
0     2
4    16
2    29
Name: a, dtype: int32
c
0     5
4    14
2    27
Name: c, dtype: int32
b
0    3
4    2
2    2
Name: b, dtype: int32
d
0     7
4    16
2    25
Name: d, dtype: int32

iteritems()迭代函数应用

# iteritems()迭代函数应用
import pandas as pd
import numpy as np
data = np.array([[2, 5, 3, 7], [16, 14, 2, 16], [29, 27, 2, 25]])
df = pd.DataFrame(data, index=[0, 4, 2], columns=['a', 'c', 'b', 'd'])
for key, value in df.iteritems():
    print('key=',key)
    print(value)

## 输出 ##
key= a
0     2
4    16
2    29
Name: a, dtype: int32
key= c
0     5
4    14
2    27
Name: c, dtype: int32
key= b
0    3
4    2
2    2
Name: b, dtype: int32
key= d
0     7
4    16
2    25
Name: d, dtype: int32

迭代DataFrame的行

遍历DataFrame的行可以使用以下函数：
①iterrows(): 将行迭代为对，产生每个行索引值以及包含每行数据的序列；
②itertuples(): 以namedtuples的形式迭代行，是一个命名元组迭代器，其中的值是行的数据。

iterrows()迭代函数应用

# iterrows()迭代函数应用
import pandas as pd
import numpy as np
data = np.array([[2, 5, 3, 7], [16, 14, 2, 16], [29, 27, 2, 25]])
df = pd.DataFrame(data, index=[0, 4, 2], columns=['a', 'c', 'b', 'd'])
for row_index, row in df.iterrows():
    print('row_index=',row_index)
    print(row)

## 输出 ##
row_index= 0
a    2
c    5
b    3
d    7
Name: 0, dtype: int32
row_index= 4
a    16
c    14
b     2
d    16
Name: 4, dtype: int32
row_index= 2
a    29
c    27
b     2
d    25
Name: 2, dtype: int32

itertuples()函数应用

# itertuples()函数应用
import pandas as pd
import numpy as np
data = np.array([[2, 5, 3, 7], [16, 14, 2, 16], [29, 27, 2, 25]])
df = pd.DataFrame(data, index=[0, 4, 2], columns=['a', 'c', 'b', 'd'])
for row in df.itertuples():
    print(row)

## 输出 ##
Pandas(Index=0, a=2, c=5, b=3, d=7)
Pandas(Index=4, a=16, c=14, b=2, d=16)
Pandas(Index=2, a=29, c=27, b=2, d=25)

唯一值与值计数

唯一值函数unique()的作用是去重，只留下不重复的元素；
值计数函数value_counts()的作用是计算去重之后的每一个元素的个数。

import pandas as pd
s = pd.Series(list('everyone should learn pandas'))
print('s.unique=', s.unique())
print('计数：')
print(s.value_counts())
print('--------------')
df = pd.DataFrame({'a': [1,2,3,4,3,2,2], 'b': [3,2,3,3,2,3,4]})
print(df['a'].unique())
print('--------------')
print(df.iloc[1].value_counts()) # 行

## 输出 ##
s.unique= ['e' 'v' 'r' 'y' 'o' 'n' ' ' 's' 'h' 'u' 'l' 'd' 'a' 'p']
计数：
e    4
n    3
     3
a    3
o    2
l    2
r    2
d    2
s    2
u    1
h    1
p    1
y    1
v    1
dtype: int64
--------------
[1 2 3 4]
--------------
2    2
Name: 1, dtype: int64

描述性统计

注意：
① abs()、prod()、cumprod()函数无法执行包含字符或字符串的数据，否则会出现异常；
② 函数通常采用轴参数进行统计，轴参数可以通过名称或整数来指定，当axis=0(默认)时按行(index)来统计，当axis=1时按列(column)来统计；
③ describe()函数的参数include，指定显示摘要的哪些信息，其值包括3个，object（汇总字符串列）、number（汇总数字列）、all（将所有列汇总在一起），include的默认值为number。

import pandas as pd
import numpy as np
d = {'a': [1, 2, 3, 4],
    'b': [5, 6, 7, 8],
    'c': [9, 10, 11, 12]}
df = pd.DataFrame(d)
print(df)
print('count=', df.count()) # 非空观测数量
print('mean=', df.mean()) # 所有值的平均值
print('sum=', df.sum()) # 所有值之和
print('median=', df.median()) # 取中位数
print('mode=', df.mode()) # 值的模值
print('std=', df.std()) # 值的标准偏差
print('min=', df.min()) # 最小值
print('max=', df.max()) # 最大值
print('abs=', df.abs()) # 绝对值
print('prod=', df.prod()) # 数组元素的乘积
print('cumsum=', df.cumsum()) # 累计总和
print('cumprod=', df.cumprod()) # 累计乘积

## 输出 ##
   a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
count= a    4
b    4
c    4
dtype: int64
mean= a     2.5
b     6.5
c    10.5
dtype: float64
sum= a    10
b    26
c    42
dtype: int64
median= a     2.5
b     6.5
c    10.5
dtype: float64
mode=    a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
std= a    1.290994
b    1.290994
c    1.290994
dtype: float64
min= a    1
b    5
c    9
dtype: int64
max= a     4
b     8
c    12
dtype: int64
abs=    a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
prod= a       24
b     1680
c    11880
dtype: int64
cumsum=     a   b   c
0   1   5   9
1   3  11  19
2   6  18  30
3  10  26  42
cumprod=     a     b      c
0   1     5      9
1   2    30     90
2   6   210    990
3  24  1680  11880

import numpy as np
import pandas as pd
d = {'a': [1, 2, 3, 4],
    'b': [5, 6, 7, 8],
    'c': [9, 10, 11, 12]}
df = pd.DataFrame(d)
print(df)

print('mean(axis=1)=', df.mean(axis=1))
print('sum(axis=1)=', df.sum(axis=1))

print('cumsum(axis=1)=', df.cumsum(axis=1))
print('cumprod(axis=1)=', df.cumprod(axis=1))

print('describe=', df.describe())

## 输出 ##
   a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
mean(axis=1)= 0    5.0
1    6.0
2    7.0
3    8.0
dtype: float64
sum(axis=1)= 0    15
1    18
2    21
3    24
dtype: int64
cumsum(axis=1)=    a   b   c
0  1   6  15
1  2   8  18
2  3  10  21
3  4  12  24
cumprod(axis=1)=    a   b    c
0  1   5   45
1  2  12  120
2  3  21  231
3  4  32  384
describe=               a         b          c
count  4.000000  4.000000   4.000000
mean   2.500000  6.500000  10.500000
std    1.290994  1.290994   1.290994
min    1.000000  5.000000   9.000000
25%    1.750000  5.750000   9.750000
50%    2.500000  6.500000  10.500000
75%    3.250000  7.250000  11.250000
max    4.000000  8.000000  12.000000

缺失值

检测缺失值

isnull()函数可以检查数据中的缺失值，返回一个布尔值的矩阵；
notnull()函数与isnull()函数意思相反，返回的布尔值为True时表示非缺失值。

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 2, np.nan, 4],
                  'b': [5, np.nan, 7, 8],
                  'c': [9, 10, 11, np.nan],
                  'd': [13, 14, 15, 16]})
print('原数据：')
print(df)
print(df.isnull())
print(df.notnull())

## 输出 ##
原数据：
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16
       a      b      c      d
0  False  False  False  False
1  False   True  False  False
2   True  False  False  False
3  False  False   True  False
       a      b      c     d
0   True   True   True  True
1   True  False   True  True
2  False   True   True  True
3   True   True  False  True

处理缺失值

缺失值的处理主要有4种方法：
删除法——dropna()
固定值替换法——replace()
填充法——fillna()
插值法——拉格朗日插值法(lagrange)和牛顿插值法

删除法

# 删除法
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 2, np.nan, 4],
                  'b': [5, np.nan, 7, 8],
                  'c': [9, 10, 11, np.nan],
                  'd': [13, 14, 15, 16]})
print('原数据：')
print(df)
print('删除包含NaN值的行：', df.dropna())
print('------------------')

print(df)
print('删除包含NaN值的列：', df.dropna(axis=1))
print('------------------')
print('删除都是NaN值的行：', df.dropna(how='all'))
print('------------------')

df.iloc[0] = np.nan
print('第0行都是NaN值，删除：', df.dropna(how='all'))

## 输出 ##
原数据：
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16
删除包含NaN值的行：      a    b    c   d
0  1.0  5.0  9.0  13
------------------
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16
删除包含NaN值的列：     d
0  13
1  14
2  15
3  16
------------------
删除都是NaN值的行：      a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16
------------------
第0行都是NaN值，删除：      a    b     c     d
1  2.0  NaN  10.0  14.0
2  NaN  7.0  11.0  15.0
3  4.0  8.0   NaN  16.0

固定值替换法

# 固定值替换法
import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, np.nan, 4],
                  'b': [5, np.nan, 7, 8],
                  'c': [9, 10, 11, np.nan],
                  'd': [13, 14, 15, 16]})
print('原数据：')
print(df)
print('替换后数据：')
print(df.replace(np.nan, 0))

## 输出 ##
原数据：
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16

替换后数据：
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  0.0  10.0  14
2  0.0  7.0  11.0  15
3  4.0  8.0   0.0  16

填充法

# 填充法
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 2, np.nan, 4],
                  'b': [5, np.nan, 7, 8],
                  'c': [9, 10, 11, np.nan],
                  'd': [13, 14, 15, 16]})
print('原数据：')
print(df)
print('#########################')
print(df.fillna(0)) #固定值
print(df.fillna(df.mean())) #平均值（列）
print(df.fillna(method='bfill')) # 最近邻，向下

## 输出 ##
原数据：
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  NaN  10.0  14
2  NaN  7.0  11.0  15
3  4.0  8.0   NaN  16
#########################
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  0.0  10.0  14
2  0.0  7.0  11.0  15
3  4.0  8.0   0.0  16
          a         b     c   d
0  1.000000  5.000000   9.0  13
1  2.000000  6.666667  10.0  14
2  2.333333  7.000000  11.0  15
3  4.000000  8.000000  10.0  16
     a    b     c   d
0  1.0  5.0   9.0  13
1  2.0  7.0  10.0  14
2  4.0  7.0  11.0  15
3  4.0  8.0   NaN  16

插值法

# 插值法 interpolate lagrange interpld 
import scipy.interpolate as interpolate
import numpy as np
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5, 6, 9, 10, 11, 12]
b = [10, 16, 21, 32, 35, 43, 58, 62, 67, 70]
print(a)
print(b)
linear = interpolate.interpld(a, b, kind='linear')
plt.plot(linear([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), '-.')
print('线性插值法求出的ss[7:9]=', linear([7,8]))
lagrange = interpolate.lagrange(a, b)
plt.plot(lagrange([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), '--')
print('拉格朗日插值法求出的ss[2]=', lagrange([7,8]))

## 输出 ##
[1, 2, 3, 4, 5, 6, 9, 10, 11, 12]
[10, 16, 21, 32, 35, 43, 58, 62, 67, 70]

AttributeError: module 'scipy.interpolate' has no attribute 'interpld'
### 这里有报错，可能是因为包更新的缘故，但使用方法大差不差