Pandas教程（非常详细）（第三部分）

路由跳变

已于 2024-01-05 22:12:48 修改

阅读量499

点赞数

分类专栏： Pandas教程文章标签： pandas

于 2023-11-08 09:00:00 首次发布

本文链接：https://blog.csdn.net/sinat_41942180/article/details/134149338

版权

Pandas教程专栏收录该内容

6 篇文章 7 订阅

订阅专栏

接着Pandas教程（非常详细）（第一部分），继续讲述。

十三、Pandas设置数据显示格式

在用 Pandas 做数据分析的过程中，总需要打印数据分析的结果，如果数据体量较大就会存在输出内容不全（部分内容省略）或者换行错误等问题。Pandas 为了解决上述问题，允许你对数据显示格式进行设置。下面列出了五个用来设置显示格式的函数，分别是：

get_option()
set_option()
reset_option()
describe_option()
option_context()

它们的功能介绍如下：

函数名称	说明
get_option	获取解释器的默认参数值。
set_option	更改解释器的默认参数值。
reset_option	解释器的参数重置为默认值。
describe_option	输出参数的描述信息。
option_context	临时设置解释器参数，当退出使用的语句块时，恢复为默认值。

下面对上述函数分别进行介绍。

1、get_option()

该函数接受单一参数，用来获取显示上限的行数或者列数，示例如下：

（1） display.max_rows

获取显示上限的行数，示例如下：

import pandas as pd
print (pd.get_option("display.max_rows"))

输出结果：

60

（2） display.max_columns

获取显示上限的列数，示例如下：

import pandas as pd
print (pd.get_option("display.max_columns"))

输出结果：

20

由此可知，默认值显示上限是（60，20）。

2、set_option()

该函数用来更改要默认显示的行数和列数，示例如下：

（1）修改默认行数

import pandas as pd
pd.set_option("display.max_rows",70)
print (pd.get_option("display.max_rows"))

输出结果：

70

（2）修改默认列数

import pandas as pd
pd.set_option("display.max_columns",40)
print (pd.get_option("display.max_columns"))

输出结果：

40

3、reset_option()

该方法接受一个参数，并将修改后的值设置回默认值。示例如下：

import pandas as pd
pd.reset_option("display.max_rows")
#恢复为默认值
print(pd.get_option("display.max_rows"))

输出结果：

60

4、describe_option()

该方法输出参数的描述信息。示例如下：

import pandas as pd
pd.describe_option("display.max_rows")

输出结果：

display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.

In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]

5、option_context()

option_context() 上下文管理器，用于临时设置 with 语句块中的默认显示参数。当您退出 with 语句块时，参数值会自动恢复。示例如下：

import pandas as pd
with pd.option_context("display.max_rows",10):
    print(pd.get_option("display.max_rows"))
print(pd.get_option("display.max_rows"))

输出结果：

10

60

注意：第一个 Print 语句打印 option_context() 设置的临时值。当退出 with 语句块时，第二个 Print 语句打印解释器默认值。

6、常用参数项

最后，对上述函数常用的参数项做以下总结：

参数	说明
display.max_rows	最大显示行数，超过该值用省略号代替，为None时显示所有行。
display.max_columns	最大显示列数，超过该值用省略号代替，为None时显示所有列。
display.expand_frame_repr	输出数据宽度超过设置宽度时，表示是否对其要折叠，False不折叠，True要折叠。
display.max_colwidth	单列数据宽度，以字符个数计算，超过时用省略号表示。
display.precision	设置输出数据的小数点位数。
display.width	数据显示区域的宽度，以总字符数计算。
display.show_dimensions	当数据量大需要以truncate（带引号的省略方式）显示时，该参数表示是否在最后显示数据的维数，默认 True 显示，False 不显示。

上述参数项，基本上可以满足我们的日常需求。

十四、Pandas loc/iloc用法详解

在数据分析过程中，很多时候需要从数据表中提取出相应的数据，而这么做的前提是需要先“索引”出这一部分数据。虽然通过 Python 提供的索引操作符"[]"和属性操作符"."可以访问 Series 或者 DataFrame 中的数据，但这种方式只适应与少量的数据，为了解决这一问题，Pandas 提供了两种类型的索引方式来实现数据的访问。

本节就来讲解一下，如何在 Pandas 中使用 loc 函数和 iloc 函数。两种函数说明如下：

方法名称	说明
.loc[]	基于标签索引选取数据
.iloc[]	基于整数索引选取数据

1、.loc[]

df.loc[] 只能使用标签索引，不能使用整数索引。当通过标签索引的切片方式来筛选数据时，它的取值前闭后闭，也就是只包括边界值标签（开始和结束）。

.loc[] 具有多种访问方法，如下所示：

一个标量标签
标签列表
切片对象
布尔数组

loc[] 接受两个参数，并以','分隔。第一个位置表示行，第二个位置表示列。示例如下：

import numpy as np
import pandas as pd
#创建一组数据
data = {'name': ['John', 'Mike', 'Mozla', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jack', 'Alic'],

'age': [20, 32, 29, np.nan, 15, 28, 21, 30, 37, 25],

'gender': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],

'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

label = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=label)
print(df)
#对行操作
print(df.loc['a':'d',:]) #等同于df.loc['a':'d']

输出结果：

    name   age gender isMarried

a   John 20.0 0 yes

b   Mike 32.0 0 yes

c Mozla 29.0       1 no

d   Rose NaN       1 yes

e David 15.0    0 no

f Marry 28.0      1 no

g Wansi 21.0      0 no

h   Sidy 30.0     0 yes

i   Jack 37.0 1 no

j   Alic 25.0    1 no

#从a到d,切记包含d

   name   age gender isMarried

a   John 20.0       0       yes

b   Mike 32.0       0       yes

c Mozla 29.0      1        no

d   Rose   NaN     1       yes

对列进行操作，示例如下：

import numpy as np
import pandas as pd
#创建一组数据
data = {'name': ['John', 'Mike', 'Mozla', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jack', 'Alic'],

'age': [20, 32, 29, np.nan, 15, 28, 21, 30, 37, 25],

'gender': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],

'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

label = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=label)
print(df.loc[:,'name'])

输出结果：

a John

b Mike

c Mozla

d Rose

e David

f Marry

g Wansi

h Sidy

i Jack

j Alic

Name: name, dtype: object

对行和列同时操作，示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
print(df.loc[['a','b','f','h'],['A','C']])

输出如下：

A C

a 1.168658 0.008070

b -0.076196 0.455495

f 1.224038 1.234725

h 0.050292 -0.031327

布尔值操作，示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4),index = ['a','b','c','d'], columns = ['A', 'B', 'C', 'D'])
#返回一组布尔值
print(df.loc['b']>0)

输出结果：

A     True

B     True

C    False

D     True

Name: b, dtype: bool

2、.iloc[]

df.iloc[] 只能使用整数索引，不能使用标签索引，通过整数索引切片选择数据时，前闭后开(不包含边界结束值)。同 Python 和 NumPy 一样，它们的索引都是从 0 开始。

这里指的都是行操作

.iloc[] 提供了以下方式来选择数据：

1) 整数索引
2) 整数列表
3) 数值范围

示例如下：

data = {'name': ['John', 'Mike', 'Mozla', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jack', 'Alic'],

'age': [20, 32, 29, np.nan, 15, 28, 21, 30, 37, 25],

'gender': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],

'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

label = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=label)
print(df)
print(df.iloc[2:,])

输出结果：

name age gender isMarried

a John 20.0    0 yes

b Mike 32.0 0 yes

c Mozla 29.0 1 no

d Rose NaN 1 yes

e David 15.0 0 no

f Marry 28.0 1 no

g Wansi 21.0 0 no

h Sidy    30.0 0 yes

i Jack    37.0 1 no

j Alic 25.0 1 no

name age gender isMarried

c Mozla 29.0 1 no

d Rose NaN 1 yes

e David 15.0 0 no

f Marry 28.0 1 no

g Wansi 21.0 0 no

h Sidy    30.0 0 yes

i Jack    37.0 1 no

j Alic 25.0 1 no

再看一组示例：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print df.iloc[[1, 3, 5], [1, 3]]
print df.iloc[1:3, :]
print df.iloc[:,1:3]

输出结果：

B D

1 0.773595 -0.206061

3 -1.740403 -0.464383

5 1.046009 0.606808

A B C D

1 -0.093711 0.773595 0.966408 -0.206061

2 -1.122587 -0.135011 0.546475 -0.551403

B C

0 0.623488 3.328406

1 0.773595 0.966408

2 -0.135011 0.546475

3 -1.740403 -0.869073

4 0.591573 -1.463275

5 1.046009 2.330035

6 -0.266607 0.873971

7 -1.059625 -0.405340

十五、Python Pandas统计函数

Pandas 的本质是统计学原理在计算机领域的一种应用实现，通过编程的方式达到分析、描述数据的目的。而统计函数则是统计学中用于计算和分析数据的一种工具。在数据分析的过程中，使用统计函数有助于我们理解和分析数据。本节将学习几个常见的统计函数，比如百分比函数、协方差函数、相关系数等。

1、百分比变化(pct_change)

Series 和 DatFrames 都可以使用 pct_change() 函数。该函数将每个元素与其前一个元素进行比较，并计算前后数值的百分比变化。示例如下：

import pandas as pd
import numpy as np
#Series结构
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())
#DataFrame
df = pd.DataFrame(np.random.randn(5, 2))
print(df.pct_change())

输出结果：

0 NaN

1 1.000000

2 0.500000

3 0.333333

4 0.250000

5 -0.200000

dtype: float64

0 1

0 NaN NaN

1 74.779242 0.624260

2 -0.353652 -1.104352

3 -2.422813 -13.994103

4 -3.828316 -1.853092

默认情况下，pct_change() 对列进行操作，如果想要操作行，则需要传递参数 axis=1 参数。示例如下：

import pandas as pd
import numpy as np
#DataFrame
df = pd.DataFrame(np.random.randn(3, 2))
print(df.pct_change(axis=1))

输出结果：

0 1

0 NaN 3.035670

1 NaN -0.318259

2 NaN 0.227580

2、协方差(cov)

Series 对象提供了一个cov方法用来计算 Series 对象之间的协方差。同时，该方法也会将缺失值(NAN )自动排除。

示例如下：

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))

输出结果：

0.20789380904226645

当应用于 DataFrame 时，协方差（cov）将计算所有列之间的协方差。

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
#计算a与b之间的协方差值
print (frame['a'].cov(frame['b']))
#计算所有数列的协方差值
print (frame.cov())

输出结果：

-0.37822395480394827

a b c d e

a 1.643529 -0.378224 0.181642 0.049969 -0.113700

b -0.378224 1.561760 -0.054868 0.144664 -0.231134

c 0.181642 -0.054868 0.628367 -0.125703 0.324442

d 0.049969 0.144664 -0.125703 0.480301 -0.388879

e -0.113700 -0.231134 0.324442 -0.388879 0.848377

3、相关系数(corr)

相关系数显示任意两个 Series 之间的线性关系。Pandas 提供了计算相关性的三种方法，分别是 pearson(default)、spearman() 和 kendall()。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (df['b'].corr(frame['c']))
print (df.corr())

输出结果：

0.5540831507407936

a b c d e

a 1.000000 -0.500903 -0.058497 -0.767226 0.218416

b -0.500903 1.000000 -0.091239 0.805388 -0.020172

c -0.058497 -0.091239 1.000000 0.115905 0.083969

d -0.767226 0.805388 0.115905 1.000000 0.015028

e 0.218416 -0.020172 0.083969 0.015028 1.000000

注意：如果 DataFrame 存在非数值(NAN），该方法会自动将其删除。

4、排名(rank)

rank() 按照某种规则（升序或者降序）对序列中的元素值排名，该函数的返回值的也是一个序列，包含了原序列中每个元素值的名次。如果序列中包含两个相同的的元素值，那么会为其分配两者的平均排名。示例如下：

import pandas as pd
import numpy as np
#返回5个随机值，然后使用rank对其排名
s = pd.Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b']
print(s)
#a/b排名分别为2和3，其平均排名为2.5
print(s.rank())

输出结果：

a   -0.689585

b   -0.545871

c    0.148264

d   -0.545871

e   -0.205043

dtype: float64

排名后输出:

a    1.0

b    2.5

c    5.0

d    2.5

e    4.0

dtype: float64

（1） method参数

rank() 提供了 method 参数，可以针对相同数据，进行不同方式的排名。如下所示：

average：默认值，如果数据相同则分配平均排名；
min：给相同数据分配最低排名；
max：给相同数据分配最大排名；
first：对于相同数据，根据出现在数组中的顺序进行排名。

（2） aisx&ascening

rank() 有一个ascening参数， 默认为 True 代表升序；如果为 False，则表示降序排名（将较大的数值分配给较小的排名）。

rank() 默认按行方向排名（axis=0），也可以更改为 axis =1，按列排名。示例如下：

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4),columns = list("abdc"))
a =a.sort_index(axis=1,ascending=False)
a.iloc[[1,1],[1,2]] = 6
#按行排名，将相同数值设置为所在行数值的最大排名
print(a.rank(axis=1,method="max"))

输出结果：

d c b a

0 3.0 4.0 2.0 1.0

1 4.0 4.0 4.0 1.0

2 3.0 4.0 2.0 1.0

（1）先是按照列名进行降序，得到

d c b a
0 2 3 1 0
1 6 7 5 4
2 10 11 9 8

（2）然后将(1,1)和(1,2)位置的值设置为6

d c b a
0 2 3 1 0
1 6 6 6 4
2 10 11 9 8
（3）使用rank()方法按行进行排名，同时采用"max"方法来处理相同值。这意味着对于每一行，如果有多个元素具有相同的值，它们将被赋予相同的排名，并且排名将是这些相同值的最大排名。让我们看看rank()的输出：

d c b a

0 3.0 4.0 2.0 1.0

1 4.0 4.0 4.0 1.0

2 3.0 4.0 2.0 1.0

与 method="min"进行对比，如下所示：

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4),columns = list("abdc"))
a =a.sort_index(axis=1,ascending=False)
a.iloc[[1,1],[1,2]] = 6
#按行排名，将相同数值设置为所在行数值的最小排名
print(a.rank(axis=1,method="min"))

输出结果：

d c b a

0 3.0 4.0 2.0 1.0

1 2.0 2.0 2.0 1.0

2 3.0 4.0 2.0 1.0

十六、Python Pandas窗口函数

为了能更好地处理数值型数据，Pandas 提供了几种窗口函数，比如移动函数（rolling）、扩展函数（expanding）和指数加权函数（ewm）。

窗口函数应用场景非常多。举一个简单的例子：现在有 10 天的销售额，而您想每 3 天求一次销售总和，也就说第五天的销售额等于（第三天 + 第四天 + 第五天）的销售额之和，此时窗口函数就派上用场了。

窗口是一种形象化的叫法，这些函数在执行操作时，就如同窗口一样在数据区间上移动。

本节学习主要讲解如何在 DataFrame 和 Series 对象上应用窗口函数。

1、rolling()

rolling() 又称移动窗口函数，它可以与 mean、count、sum、median、std 等聚合函数一起使用。为了使用方便，Pandas 为移动函数定义了专门的方法聚合方法，比如 rolling_mean()、rolling_count()、rolling_sum() 等。其的语法格式如下：

rolling(window=n, min_periods=None, center=False)

常用参数说明如下：

参数名称	说明
window	默认值为 1，表示窗口的大小，也就是观测值的数量，
min_periods	表示窗口的最小观察值，默认与 window 的参数值相等。
center	是否把中间值做为窗口标准，默认值为 False。

下面看一组示例：

import pandas as pd
import numpy as np
#生成时间序列
df = pd.DataFrame(np.random.randn(8, 4),index = pd.date_range('12/1/2020', periods=8),columns = ['A', 'B', 'C', 'D'])
print(df)
#每3个数求求一次均值
print(df.rolling(window=3).mean())

输出结果：

A B C D

2020-12-01 0.580058 -0.715246 0.440427 -1.106783

2020-12-02 -1.313982 0.068954 -0.906665 1.382941

2020-12-03 0.349844 -0.549509 -0.806577 0.261794

2020-12-04 -0.497054 0.921995 0.232008 -0.815291

2020-12-05 2.658108 0.447783 0.049340 0.329209

2020-12-06 -0.271670 -0.070299 0.860684 -0.095122

2020-12-07 -0.706780 -0.949392 0.679680 0.230930

2020-12-08 0.027379 -0.056543 -1.067625 1.386399

A B C D

2020-12-01 NaN NaN NaN NaN

2020-12-02 NaN NaN NaN NaN

2020-12-03 -0.128027 -0.398600 -0.424272 0.179317

2020-12-04 -0.487064 0.147147 -0.493745 0.276481

2020-12-05 0.836966 0.273423 -0.175076 -0.074763

2020-12-06 0.629794 0.433160 0.380677 -0.193734

2020-12-07 0.559886 -0.190636 0.529901 0.155006

2020-12-08 -0.317024 -0.358745 0.157580 0.507402

window=3表示是每一列中依次紧邻的每 3 个数求一次均值。当不满足 3 个数时，所求值均为 NaN 值，因此前两列的值为 NaN，直到第三行值才满足要求 window =3。求均值的公式如下所示：

(index1+index2+index3)/3

2、expanding()

expanding() 又叫扩展窗口函数，扩展是指由序列的第一个元素开始，逐个向后计算元素的聚合值。

下面示例，min_periods = n表示向后移动 n 个值计求一次平均值：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2020', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df.expanding(min_periods=3).mean())

输出结果：

A B C D

2020-01-01 NaN NaN NaN NaN

2020-01-02 NaN NaN NaN NaN

2020-01-03 -0.567833 0.258723 0.498782 0.403639

2020-01-04 -0.384198 -0.093490 0.456058 0.459122

2020-01-05 -0.193821 0.085318 0.389533 0.552429

2020-01-06 -0.113941 0.252397 0.214789 0.455281

2020-01-07 0.147863 0.400141 -0.062493 0.565990

2020-01-08 -0.036038 0.452132 -0.091939 0.371364

2020-01-09 -0.043203 0.368912 -0.033141 0.328143

2020-01-10 -0.100571 0.349378 -0.078225 0.225649

设置 min_periods=3，表示至少 3 个数求一次均值，计算方式为 (index0+index1+index2)/3，而 index3 的计算方式是 (index0+index1+index2+index3)/3，依次类推。

3、ewm()

ewm（全称 Exponentially Weighted Moving）表示指数加权移动。ewn() 函数先会对序列元素做指数加权运算，其次计算加权后的均值。该函数通过指定 com、span 或者 halflife 参数来实现指数加权移动。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('12/1/2020', periods=10),
columns = ['A', 'B', 'C', 'D'])
#设置com=0.5，先加权再求均值
print(df.ewm(com=0.5).mean())

输出结果：

A B C D

2020-12-01 -1.511428 1.427826 0.252652 0.093601

2020-12-02 -1.245101 -0.118346 0.170232 -0.207065

2020-12-03 0.131456 -0.271979 -0.679315 -0.589689

2020-12-04 -0.835228 0.094073 -0.973924 -0.081684

2020-12-05 1.279812 1.099368 0.203033 0.019014

2020-12-06 0.132027 -0.625744 -0.145090 -0.318155

2020-12-07 0.820230 0.371620 0.119683 -0.227101

2020-12-08 1.088283 -0.275570 0.358557 -1.050606

2020-12-09 0.538304 -1.288146 0.590358 -0.164057

2020-12-10 0.589177 -1.514472 -0.613158 0.367322

在数据分析的过程中，使用窗口函数能够提升数据的准确性，并且使数据曲线的变化趋势更加平滑，从而让数据分析变得更加准确、可靠。

十七、Python Pandas聚合函数

在十六一节，我们重点介绍了窗口函数。我们知道，窗口函数可以与聚合函数一起使用，聚合函数指的是对一组数据求总和、最大值、最小值以及平均值的操作，本节重点讲解聚合函数的应用。

应用聚合函数

首先让我们创建一个 DataFrame 对象，然后对聚合函数进行应用。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])
print (df)
#窗口大小为3，min_periods 最小观测值为1
r = df.rolling(window=3,min_periods=1)
print(r)

输出结果：

A B C D

2020-12-14 0.941621 1.205489 0.473771 -0.348169

2020-12-15 -0.276954 0.076387 0.104194 1.537357

2020-12-16 0.582515 0.481999 -0.652332 -1.893678

2020-12-17 -0.286432 0.923514 0.285255 -0.739378

2020-12-18 2.063422 -0.465873 -0.946809 1.590234

Rolling [window=3,min_periods=1,center=False,axis=0]

（1）对整体聚合

您可以把一个聚合函数传递给 DataFrame，示例如下：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])
print (df)
#窗口大小为3，min_periods 最小观测值为1
r = df.rolling(window=3,min_periods=1)
#使用 aggregate()聚合操作
print(r.aggregate(np.sum))

输出结果：

A B C D

2020-12-14 0.133713 0.746781 0.499385 0.589799

2020-12-15 -0.777572 0.531269 0.600577 -0.393623

2020-12-16 0.408115 -0.874079 0.584320 0.507580

2020-12-17 -1.033055 -1.185399 -0.546567 2.094643

2020-12-18 0.469394 -1.110549 -0.856245 0.260827

A B C D

2020-12-14 0.133713 0.746781 0.499385 0.589799

2020-12-15 -0.643859 1.278050 1.099962 0.196176

2020-12-16 -0.235744 0.403971 1.684281 0.703756

2020-12-17 -1.402513 -1.528209 0.638330 2.208601

2020-12-18 -0.155546 -3.170027 -0.818492 2.863051

（2）对任意某一列聚合

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])
#窗口大小为3，min_periods 最小观测值为1
r = df.rolling(window=3,min_periods=1)
#对 A 列聚合
print(r['A'].aggregate(np.sum))

输出结果：

2020-12-14 1.051501

2020-12-15 1.354574

2020-12-16 0.896335

2020-12-17 0.508470

2020-12-18 2.333732

Freq: D, Name: A, dtype: float64

（3）对多列数据聚合

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])
#窗口大小为3，min_periods 最小观测值为1
r = df.rolling(window=3,min_periods=1)
#对 A/B 两列聚合
print(r['A','B'].aggregate(np.sum))

输出结果：

A B

2020-12-14 0.639867 -0.229990

2020-12-15 0.352028 0.257918

2020-12-16 0.637845 2.643628

2020-12-17 0.432715 2.428604

2020-12-18 -1.575766 0.969600

（4）对单列应用多个函数

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4),index = pd.date_range('12/14/2020', periods=5),columns = ['A', 'B', 'C', 'D'])
#窗口大小为3，min_periods 最小观测值为1
r = df.rolling(window=3,min_periods=1)
#对 A/B 两列聚合
print(r['A','B'].aggregate([np.sum,np.mean]))

输出结果：

sum mean

2020-12-14 -0.469643 -0.469643

2020-12-15 -0.626856 -0.313428

2020-12-16 -1.820226 -0.606742

2020-12-17 -2.007323 -0.669108

2020-12-18 -0.595736 -0.198579

（5）对不同列应用多个函数

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4),
index = pd.date_range('12/11/2020', periods=5),
columns = ['A', 'B', 'C', 'D'])
r = df.rolling(window=3,min_periods=1)
print( r['A','B'].aggregate([np.sum,np.mean]))

输出结果：

A B

sum mean sum mean

2020-12-14 -1.428882 -1.428882 -0.417241 -0.417241

2020-12-15 -1.315151 -0.657576 -1.580616 -0.790308

2020-12-16 -2.093907 -0.697969 -2.260181 -0.753394

2020-12-17 -1.324490 -0.441497 -1.578467 -0.526156

2020-12-18 -2.400948 -0.800316 -0.452740 -0.150913

（6）对不同列应用不同函数

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('12/14/2020', periods=3),
columns = ['A', 'B', 'C', 'D'])
r = df.rolling(window=3,min_periods=1)
print(r.aggregate({'A': np.sum,'B': np.mean}))

输出结果：

A B

2020-12-14 0.503535 -1.301423

2020-12-15 0.170056 -0.550289

2020-12-16 -0.086081 -0.140532

十八、Python Pandas缺失值处理

在一些数据分析业务中，数据缺失是我们经常遇见的问题，缺失值会导致数据质量的下降，从而影响模型预测的准确性，这对于机器学习和数据挖掘影响尤为严重。因此妥善的处理缺失值能够使模型预测更为准确和有效。

1、为什么会存在缺失值？

前面章节的示例中，我们遇到过很多 NaN 值，关于缺失值您可能会有很多疑问，数据为什么会丢失数据呢，又是从什么时候丢失的呢？通过下面场景，您会得到答案。

其实在很多时候，人们往往不愿意过多透露自己的信息。假如您正在对用户的产品体验做调查，在这个过程中您会发现，一些用户很乐意分享自己使用产品的体验，但他是不愿意透露自己的姓名和联系方式；还有一些用户愿意分享他们使用产品的全部经过，包括自己的姓名和联系方式。因此，总有一些数据会因为某些不可抗力的因素丢失，这种情况在现实生活中会经常遇到。

2、什么是稀疏数据？

稀疏数据，指的是在数据库或者数据集中存在大量缺失数据或者空值，我们把这样的数据集称为稀疏数据集。稀疏数据不是无效数据，只不过是信息不全而已，只要通过适当的方法就可以“变废为宝”。

稀疏数据的来源与产生原因有很多种，大致归为以下几种：

由于调查不当产生的稀疏数据；
由于天然限制产生的稀疏数据；
文本挖掘中产生的稀疏数据。

3、缺失值处理

那么 Pandas 是如何处理缺失值的呢，下面让我们一起看一下。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

输出结果：

0 1 2

a 0.187208 -0.951407 0.316340

b NaN NaN NaN

c -0.365741 -1.983977 -1.052170

d NaN NaN NaN

e -1.024180 1.550515 0.317156

f -0.799921 -0.686590 1.383229

g NaN NaN NaN

h -0.207958 0.426733 -0.325951

上述示例，通过使用 reindex（重构索引），我们创建了一个存在缺少值的 DataFrame 对象。

4、检查缺失值

为了使检测缺失值变得更容易，Pandas 提供了 isnull() 和 notnull() 两个函数，它们同时适用于 Series 和 DataFrame 对象。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].isnull())

输出结果：

a False

b True

c False

d True

e False

f False

g True

h False

Name: 1, dtype: bool

notnull() 函数，使用示例：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].notnull()

输出结果：

a True

b False

c True

d False

e True

f True

g False

h True

Name: 1, dtype: bool

5、缺失数据计算

计算缺失数据时，需要注意两点：首先数据求和时，将 NA 值视为 0 ，其次，如果要计算的数据为 NA，那么结果就是 NA。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())
print()

输出结果：

3.4516595395128

6、清理并填充缺失值

Pandas 提供了多种方法来清除缺失值。fillna() 函数可以实现用非空数据“填充”NaN 值。

（1）用标量值替换NaN值

下列程序将 NaN 值替换为了 0，如下所示：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
#用 0 填充 NaN
print (df.fillna(0))

输出结果：

one two three

a 1.497185 -0.703897 -0.050513

b NaN NaN NaN

c 2.008315 1.342690 -0.255855

one two three

a 1.497185 -0.703897 -0.050513

b 0.000000 0.000000 0.000000

c 2.008315 1.342690 -0.255855

当然根据您自己的需求，您也可以用其他值进行填充。

（2）向前和向后填充NA

在第八节《Pandas reindex》，我们介绍了 ffill() 向前填充和 bfill() 向后填充，使用这两个函数也可以处理 NA 值。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.fillna(method='ffill')

输出结果：

one two three

a 0.871741 0.311057 0.091005

b 0.871741 0.311057 0.091005

c 0.107345 -0.662864 0.826716

d 0.107345 -0.662864 0.826716

e 1.630221 0.482504 -0.728767

f 1.283206 -0.145178 0.109155

g 1.283206 -0.145178 0.109155

h 0.222176 0.886768 0.347820

或者您也可以采用向后填充的方法。

（3）使用replace替换通用值

在某些情况下，您需要使用 replace() 将 DataFrame 中的通用值替换成特定值，这和使用 fillna() 函数替换 NaN 值是类似的。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,666], 'two':[99,0,30,40,50,60]})
#使用replace()方法
print (df.replace({99:10,666:60,0:20}))

输出结果：

one two

0 10 10

1 20 20

2 30 30

3 40 40

4 50 50

5 60 60

7、删除缺失值

如果想删除缺失值，那么使用 dropna() 函数与参数 axis 可以实现。在默认情况下，按照 axis=0 来按行处理，这意味着如果某一行中存在 NaN 值将会删除整行数据。示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
#删除缺失值
print (df.dropna())

输出结果：

one two three

a -2.025435 0.617616 0.862096

b       NaN       NaN       NaN

c -1.710705 1.780539 -2.313227

d       NaN       NaN       NaN

e -2.347188 -0.498857 -1.070605

f -0.159588 1.205773 -0.046752

g       NaN       NaN       NaN

h -0.549372 -1.740350 0.444356

one two three

a -2.025435 0.617616 0.862096

c -1.710705 1.780539 -2.313227

e -2.347188 -0.498857 -1.070605

f -0.159588 1.205773 -0.046752

h -0.549372 -1.740350 0.444356