10分钟入门pandas

最新推荐文章于 2022-11-22 21:54:54 发布

狮子HH

最新推荐文章于 2022-11-22 21:54:54 发布

阅读量5k

点赞数 1

分类专栏： python 文章标签： pandas dataframe series

本文链接：https://blog.csdn.net/yingyujianmo/article/details/51852280

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

这篇博客介绍了pandas的基本操作，包括创建DataFrame、查看数据、选择数据、处理缺失值、运算、数据融合、分组、时间序列分析以及读写数据等。通过实例展示了如何使用pandas进行数据处理和分析，适合初学者入门。

摘要由CSDN通过智能技术生成

本文是对pandas的一个入门介绍，仅仅针对初学者。如果需要更详细的内容，请移步[Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook). 首先，导入所需要的python包：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

创建对象 ———– pandas中的数据结构包括Series、DataFrame、Panel、Pannel4D等，详细介绍移步[数据结构介绍](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro). 常用的数据结构是前两个：Series和DataFrame。通过传入一个已有的python列表（list）对象来创建一个Series对象。

s = pd.Series([1,3,4,np.nan,6,8])

0 1.0 1 3.0 2 4.0 3 NaN 4 6.0 5 8.0 dtype: float64 通过传入一个numpy数组来构建一个DataFrame对象。使用时间序列作为每行的索引，并为每列数据分配一个列名。

dates = pd.date_range('20130101', periods=6)

dates

DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’, ‘2013-01-05’, ‘2013-01-06’], dtype=’datetime64[ns]’, freq=’D’)

# 创建DataFrame对象，并指定索引index和列名columns
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

通过传入一个python字典对象来创建一个DataFrame对象。

df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20160102'),
                    'C': pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D': np.array([3]*4, dtype='int32'),
                    'E': pd.Categorical(['test','train','test','train']),
                    'F': 'foo'})

df2

	A	B	C	D	E	F
0	1.0	2016-01-02	1.0	3	test	foo
1	1.0	2016-01-02	1.0	3	train	foo
2	1.0	2016-01-02	1.0	3	test	foo
3	1.0	2016-01-02	1.0	3	train	foo

# DataFrame 中每列的数据类型可以不同
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

在ipython中可以使用“Tab”键对DataFrame的列名和公共属性进行自动补全。

查看对象中的数据

查看DataFrame的前几行或最后几行

df.head()

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950

df.tail()

	A	B	C	D
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

df.head(3)

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457

获取DataFrame的索引、列名、数据（值）。

df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

df.values

array([[-0.28589413,  0.49001051,  0.17112101, -1.54980655],
       [-0.06837701, -0.45280422, -0.39189213, -0.85252018],
       [ 1.30438846, -1.80848416, -0.28648908, -0.43745725],
       [ 1.44781215, -1.86212061,  0.11594994, -0.66413402],
       [ 0.5204089 , -1.4027399 , -0.35604882,  0.4609499 ],
       [-0.40489995,  0.58541997, -0.07392295, -0.5011969 ]])

使用“describe”获取数据的统计信息。

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.418906	-0.741786	-0.136880	-0.590694
std	0.808192	1.112849	0.244213	0.652884
min	-0.404900	-1.862121	-0.391892	-1.549807
25%	-0.231515	-1.707048	-0.338659	-0.805424
50%	0.226016	-0.927772	-0.180206	-0.582665
75%	1.108394	0.254307	0.068482	-0.453392
max	1.447812	0.585420	0.171121	0.460950

将DataFrame进行转置。

df.T

	2013-01-01 00:00:00	2013-01-02 00:00:00	2013-01-03 00:00:00	2013-01-04 00:00:00	2013-01-05 00:00:00	2013-01-06 00:00:00
A	-0.285894	-0.068377	1.304388	1.447812	0.520409	-0.404900
B	0.490011	-0.452804	-1.808484	-1.862121	-1.402740	0.585420
C	0.171121	-0.391892	-0.286489	0.115950	-0.356049	-0.073923
D	-1.549807	-0.852520	-0.437457	-0.664134	0.460950	-0.501197

df

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

对坐标轴进行排序。

df.sort_index(axis=1, ascending=False)

	D	C	B	A
2013-01-01	-1.549807	0.171121	0.490011	-0.285894
2013-01-02	-0.852520	-0.391892	-0.452804	-0.068377
2013-01-03	-0.437457	-0.286489	-1.808484	1.304388
2013-01-04	-0.664134	0.115950	-1.862121	1.447812
2013-01-05	0.460950	-0.356049	-1.402740	0.520409
2013-01-06	-0.501197	-0.073923	0.585420	-0.404900

df.sort_index(axis=0, ascending=False)

	A	B	C	D
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-01	-0.285894	0.490011	0.171121	-1.549807

对值进行排序。

df.sort_values(by='B')

	A	B	C	D
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

df.sort_values(by='B',ascending=False)

	A	B	C	D
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134

选择数据

pandas中对数据的选择可以使用标准的python/numpy方式。

df['A']

2013-01-01   -0.285894
2013-01-02   -0.068377
2013-01-03    1.304388
2013-01-04    1.447812
2013-01-05    0.520409
2013-01-06   -0.404900
Freq: D, Name: A, dtype: float64

# 选择一个单独的列，将产生一个Series，此时df['A']等价于df.A
df.A

2013-01-01   -0.285894
2013-01-02   -0.068377
2013-01-03    1.304388
2013-01-04    1.447812
2013-01-05    0.520409
2013-01-06   -0.404900
Freq: D, Name: A, dtype: float64

对行进行切片操作。

df[0:3]

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457

df['20130103':'20130105']

	A	B	C	D
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950

使用标准的python/numpy方法获取数据的方式很直观，但是对于工业级的代码，建议使用优化的pandas数据获取方法，包括：.at,.iat,.iloc和.ix

df

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.loc[dates[0]]

A   -0.285894
B    0.490011
C    0.171121
D   -1.549807
Name: 2013-01-01 00:00:00, dtype: float64

按类标选择多坐标轴的数据。

df.loc[:,['A','B']]

	A	B
2013-01-01	-0.285894	0.490011
2013-01-02	-0.068377	-0.452804
2013-01-03	1.304388	-1.808484
2013-01-04	1.447812	-1.862121
2013-01-05	0.520409	-1.402740
2013-01-06	-0.404900	0.585420

# 在对数据进行切片操作时，两端都会包含，不像python中只含前端不含后端
df.loc['20130102':'20130104',['A','B']]

	A	B
2013-01-02	-0.068377	-0.452804
2013-01-03	1.304388	-1.808484
2013-01-04	1.447812	-1.862121

# 当只有一维的时候，返回的数据维数会自动缩减
df.loc['20130105',['A','B']]

A    0.520409
B   -1.402740
Name: 2013-01-05 00:00:00, dtype: float64

df.loc['20130105','A']

0.52040890430486719

# 相对于.loc,.at是一种更快地获取一个标量数据的方法
df.at[dates[0],'A']

-0.28589413005579967

按位置进行选择，传入整数，返回数据。

df

	A	B	C	D
2013-01-01	-0.285894	0.490011	0.171121	-1.549807
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197

df.iloc[3]

A    1.447812
B   -1.862121
C    0.115950
D   -0.664134
Name: 2013-01-04 00:00:00, dtype: float64

df.iloc[3:5,0:2]

	A	B
2013-01-04	1.447812	-1.862121
2013-01-05	0.520409	-1.402740

按整数位置进行数据选取或切片时，方法同python/numpy，从0开始索引，包含前端不含后端。

df.iloc[[1,2,4],[0,2]]

	A	C
2013-01-02	-0.068377	-0.391892
2013-01-03	1.304388	-0.286489
2013-01-05	0.520409	-0.356049

# 对行进行切片
df.iloc[1:3,:]

	A	B	C	D
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457

# 对列进行切片
df.iloc[:,1:3]

	B	C
2013-01-01	0.490011	0.171121
2013-01-02	-0.452804	-0.391892
2013-01-03	-1.808484	-0.286489
2013-01-04	-1.862121	0.115950
2013-01-05	-1.402740	-0.356049
2013-01-06	0.585420	-0.073923

df.iloc[1,1]

-0.45280421688689004

# .iat 比 .iloc 具有更快的速度
df.iat[1,1]

-0.45280421688689004

使用布尔值进行索引。

df[df.A > 0]

	A	B	C	D
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457
2013-01-04	1.447812	-1.862121	0.115950	-0.664134
2013-01-05	0.520409	-1.402740	-0.356049	0.460950

df[df > 0]

	A	B	C	D
2013-01-01	NaN	0.490011	0.171121	NaN
2013-01-02	NaN	NaN	NaN	NaN
2013-01-03	1.304388	NaN	NaN	NaN
2013-01-04	1.447812	NaN	0.115950	NaN
2013-01-05	0.520409	NaN	NaN	0.46095
2013-01-06	NaN	0.585420	NaN	NaN

使用isin()方法进行过滤。

df2 = df.copy()

df2['E'] = ['one','one','two','three','four','three']

df2

	A	B	C	D	E
2013-01-01	-0.285894	0.490011	0.171121	-1.549807	one
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520	one
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457	two
2013-01-04	1.447812	-1.862121	0.115950	-0.664134	three
2013-01-05	0.520409	-1.402740	-0.356049	0.460950	four
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197	three

df2[df2['E'].isin(['one','four'])]

	A	B	C	D	E
2013-01-01	-0.285894	0.490011	0.171121	-1.549807	one
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520	one
2013-01-05	0.520409	-1.402740	-0.356049	0.460950	four

设置数据

设置一个新列，自动按索引分配数据。

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))

s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df['F'] = s1

df

	A	B	C	D	F
2013-01-01	-0.285894	0.490011	0.171121	-1.549807	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520	1.0
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457	2.0
2013-01-04	1.447812	-1.862121	0.115950	-0.664134	3.0
2013-01-05	0.520409	-1.402740	-0.356049	0.460950	4.0
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197	5.0

因为s1是从‘20130102’开始的，所以‘20130101’对应的F列值为‘NaN’

df.at[dates[0],'A'] = 0

df

	A	B	C	D	F
2013-01-01	0.000000	0.490011	0.171121	-1.549807	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	-0.852520	1.0
2013-01-03	1.304388	-1.808484	-0.286489	-0.437457	2.0
2013-01-04	1.447812	-1.862121	0.115950	-0.664134	3.0
2013-01-05	0.520409	-1.402740	-0.356049	0.460950	4.0
2013-01-06	-0.404900	0.585420	-0.073923	-0.501197	5.0

df.iat[0,1] = 0

df.loc[:,'D'] = np.array([5] * len(df))

df

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.171121	5	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0
2013-01-04	1.447812	-1.862121	0.115950	5	3.0
2013-01-05	0.520409	-1.402740	-0.356049	5	4.0
2013-01-06	-0.404900	0.585420	-0.073923	5	5.0

df2 = df.copy()

df2[df2 > 0] = -df2

df2

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.171121	-5	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	-5	-1.0
2013-01-03	-1.304388	-1.808484	-0.286489	-5	-2.0
2013-01-04	-1.447812	-1.862121	-0.115950	-5	-3.0
2013-01-05	-0.520409	-1.402740	-0.356049	-5	-4.0
2013-01-06	-0.404900	-0.585420	-0.073923	-5	-5.0

缺失数据

pandas主要使用”np.nan“表示缺失数据，默认是不参与计算的。
“reindex”使我们可以对某个轴上的索引进行增删改操作。这种操作返回的是数据的一个备份。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])

df1.loc[dates[0]:dates[1],'E'] = 1

df1

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	0.171121	5	NaN	1.0
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0	NaN
2013-01-04	1.447812	-1.862121	0.115950	5	3.0	NaN

# 将含有缺失数据的行全部去掉
df1.dropna(how='any')

	A	B	C	D	F	E
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0	1.0

# 对缺失数据进行填补
df1.fillna(value=5)

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	0.171121	5	5.0	1.0
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0	5.0
2013-01-04	1.447812	-1.862121	0.115950	5	3.0	5.0

# 获得缺失值的布尔mask
pd.isnull(df1)

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

df1

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	0.171121	5	NaN	1.0
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0	NaN
2013-01-04	1.447812	-1.862121	0.115950	5	3.0	NaN

运算

运算通常不含缺失值。

# 统计运算
df

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.171121	5	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0
2013-01-04	1.447812	-1.862121	0.115950	5	3.0
2013-01-05	0.520409	-1.402740	-0.356049	5	4.0
2013-01-06	-0.404900	0.585420	-0.073923	5	5.0

df.mean()

A    0.466555
B   -0.823455
C   -0.136880
D    5.000000
F    3.000000
dtype: float64

df.mean(1)

2013-01-01    1.292780
2013-01-02    1.017385
2013-01-03    1.241883
2013-01-04    1.540328
2013-01-05    1.552324
2013-01-06    2.021319
Freq: D, dtype: float64

在具有不同维度的对象之间进行运算时，需要进行对其。pandas会自动沿着特定维度进行扩展操作。

s = pd.Series([1,3,5,np.nan,6,8], index=dates)

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64

s = s.shift(2)

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.171121	5	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0
2013-01-04	1.447812	-1.862121	0.115950	5	3.0
2013-01-05	0.520409	-1.402740	-0.356049	5	4.0
2013-01-06	-0.404900	0.585420	-0.073923	5	5.0

df.sub(s,axis='index')

	A	B	C	D	F
2013-01-01	NaN	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN	NaN
2013-01-03	0.304388	-2.808484	-1.286489	4.0	1.0
2013-01-04	-1.552188	-4.862121	-2.884050	2.0	0.0
2013-01-05	-4.479591	-6.402740	-5.356049	0.0	-1.0
2013-01-06	NaN	NaN	NaN	NaN	NaN

sub()是减运算，df减去s时s的维度会自动进行扩展。

apply运算
apply运算将函数作用于数据。

df.apply(np.cumsum)

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.171121	5	NaN
2013-01-02	-0.068377	-0.452804	-0.220771	10	1.0
2013-01-03	1.236011	-2.261288	-0.507260	15	3.0
2013-01-04	2.683824	-4.123409	-0.391310	20	6.0
2013-01-05	3.204233	-5.526149	-0.747359	25	10.0
2013-01-06	2.799333	-4.940729	-0.821282	30	15.0

df

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.171121	5	NaN
2013-01-02	-0.068377	-0.452804	-0.391892	5	1.0
2013-01-03	1.304388	-1.808484	-0.286489	5	2.0
2013-01-04	1.447812	-1.862121	0.115950	5	3.0
2013-01-05	0.520409	-1.402740	-0.356049	5	4.0
2013-01-06	-0.404900	0.585420	-0.073923	5	5.0

np.cumsum是求元素累加和，上述操作将每行数据依次累加到下一行上。

df.apply(lambda x: x.max()-x.min())

A    1.852712
B    2.447541
C    0.563013
D    0.000000
F    4.000000
dtype: float64

柱状图
统计每个数据出现的次数。

# 随机生成0~7之间的10个整数
s = pd.Series(np.random.randint(0,7,size=10))

0    1
1    5
2    0
3    4
4    3
5    5
6    6
7    6
8    5
9    1
dtype: int64

s.value_counts()

5    3
6    2
1    2
4    1
3    1
0    1
dtype: int64

字符串方法

s = pd.Series(['A','B','C','Aaba','Baca',np.nan, 'CABA', 'dog', 'cat'])

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

数据融合

pandas提供了多种工具可以将Series、DataFrame和Panel对象按照多种逻辑结合起来。

使用concat()连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))

df

	0	1	2	3
0	0.526889	2.038465	-0.564220	0.263579
1	-0.987904	-0.306195	1.805246	0.030639
2	1.288416	-0.514634	0.450702	0.671194
3	0.209680	-0.868604	0.553508	0.173013
4	-0.443213	-0.998113	-0.237519	-0.401295
5	0.595207	0.845315	-0.914725	1.471180
6	-0.539326	-0.681776	0.491664	2.022497
7	1.083012	0.518738	0.707878	-0.337114
8	-1.322083	0.495178	-0.223462	-1.511751
9	-0.105515	-0.256568	1.591926	0.755486

# 将df切成片
pieces = [df[:3], df[3:7], df[7:]]

pieces

[          0         1         2         3
 0  0.526889  2.038465 -0.564220  0.263579
 1 -0.987904 -0.306195  1.805246  0.030639
 2  1.288416 -0.514634  0.450702  0.671194,
           0         1         2         3
 3  0.209680 -0.868604  0.553508  0.173013
 4 -0.443213 -0.998113 -0.237519 -0.401295
 5  0.595207  0.845315 -0.914725  1.471180
 6 -0.539326 -0.681776  0.491664  2.022497,
           0         1         2         3
 7  1.083012  0.518738  0.707878 -0.337114
 8 -1.322083  0.495178 -0.223462 -1.511751
 9 -0.105515 -0.256568  1.591926  0.755486]

pd.concat(pieces)

	0	1	2	3
0	0.526889	2.038465	-0.564220	0.263579
1	-0.987904	-0.306195	1.805246	0.030639
2	1.288416	-0.514634	0.450702	0.671194
3	0.209680	-0.868604	0.553508	0.173013
4	-0.443213	-0.998113	-0.237519	-0.401295
5	0.595207	0.845315	-0.914725	1.471180
6	-0.539326	-0.681776	0.491664	2.022497
7	1.083012	0.518738	0.707878	-0.337114
8	-1.322083	0.495178	-0.223462	-1.511751
9	-0.105515	-0.256568	1.591926	0.755486

join
SQL风格的数据融合。

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

left

	key	lval
0	foo	1
1	foo	2

right

	key	rval
0	foo	4
1	foo	5

pd.merge(left, right, on='key')

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

append
为dataframe增加行。

df = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])

df

	A	B	C	D
0	-0.346194	-1.878628	0.257169	0.445530
1	1.098394	-1.127943	-1.251522	-0.653498
2	1.296878	-0.757345	-2.423548	-2.233024
3	0.857649	-0.320409	0.267631	-1.337814
4	0.090567	1.460739	0.212409	-0.308281
5	0.951721	1.305034	0.721996	0.669566
6	0.104395	1.904366	-0.132059	0.436476
7	0.552328	-1.344539	0.459006	1.713434

s = df.iloc[3]

df.append(s, ignore_index=True)

	A	B	C	D
0	-0.346194	-1.878628	0.257169	0.445530
1	1.098394	-1.127943	-1.251522	-0.653498
2	1.296878	-0.757345	-2.423548	-2.233024
3	0.857649	-0.320409	0.267631	-1.337814
4	0.090567	1.460739	0.212409	-0.308281
5	0.951721	1.305034	0.721996	0.669566
6	0.104395	1.904366	-0.132059	0.436476
7	0.552328	-1.344539	0.459006	1.713434
8	0.857649	-0.320409	0.267631	-1.337814

分组

“group by” 表示以下步骤中的一步或多步操作。
- 将数据按照某些标准分为多组
- 对每个组进行一个函数运算
- 将结果结合成一个数据结构

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

	A	B	C	D
0	foo	one	0.460761	-0.001011
1	bar	one	2.001010	0.282712
2	foo	two	-1.171306	-0.085701
3	bar	three	0.723922	1.013934
4	foo	two	0.566774	-0.654899
5	bar	two	0.653483	1.013699
6	foo	one	0.072918	-0.590657
7	foo	three	-0.161579	-0.485670

df.groupby('A').sum()

	C	D
A
bar	3.378415	2.310345
foo	-0.232432	-1.817937

df.groupby(['A','B']).sum()

		C	D
A	B
bar	one	2.001010	0.282712
	three	0.723922	1.013934
	two	0.653483	1.013699
foo	one	0.533679	-0.591667
	three	-0.161579	-0.485670
	two	-0.604532	-0.740600

Reshaping

stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

index

MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df

		A	B
first	second
bar	one	0.055334	0.953745
bar	two	1.719361	0.419879
baz	one	0.180238	0.844578
baz	two	0.233350	-1.366278
foo	one	-0.285023	-0.353144
foo	two	-1.531769	-0.146243
qux	one	-0.419270	0.308597
qux	two	0.763019	0.631118

df2 = df[:4]

df2

		A	B
first	second
bar	one	0.055334	0.953745
bar	two	1.719361	0.419879
baz	one	0.180238	0.844578
baz	two	0.233350	-1.366278

stacked = df2.stack()

stacked

first  second   
bar    one     A    0.055334
               B    0.953745
       two     A    1.719361
               B    0.419879
baz    one     A    0.180238
               B    0.844578
       two     A    0.233350
               B   -1.366278
dtype: float64

stack()方法将DataFrame的列压缩了一个级别

对于一个以MultiIndex为索引的stacked DataFrame或Series,stack()的逆操作是unstack().

stacked.unstack()

		A	B
first	second
bar	one	0.055334	0.953745
bar	two	1.719361	0.419879
baz	one	0.180238	0.844578
baz	two	0.233350	-1.366278

stacked.unstack(0)

	first	bar	baz
second
one	A	0.055334	0.180238
one	B	0.953745	0.844578
two	A	1.719361	0.233350
two	B	0.419879	-1.366278

stacked.unstack(1)

	second	one	two
first
bar	A	0.055334	1.719361
bar	B	0.953745	0.419879
baz	A	0.180238	0.233350
baz	B	0.844578	-1.366278

数据透视表Pivot Tables

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})

df

	A	B	C	D	E
0	one	A	foo	0.353420	-0.570327
1	one	B	foo	1.090713	-0.046794
2	two	C	foo	-0.160874	0.595251
3	three	A	bar	0.884684	-0.027981
4	one	B	bar	0.379335	-0.387736
5	one	C	bar	0.045674	1.210791
6	two	A	foo	0.264520	-1.120149
7	three	B	foo	1.149012	0.213768
8	one	C	foo	-0.965242	-0.232711
9	one	A	bar	-0.464023	0.799239
10	two	B	bar	0.186186	-0.889300
11	three	C	bar	0.177992	1.352036

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	-0.464023	0.353420
	B	0.379335	1.090713
	C	0.045674	-0.965242
three	A	0.884684	NaN
	B	NaN	1.149012
	C	0.177992	NaN
two	A	NaN	0.264520
	B	0.186186	NaN
	C	NaN	-0.160874

时间序列

pandas具有简单、强大、高效的用于频率变换的重采样操作（例如将季节性数据变为以5分钟为间隔的数据）。

rng = pd.date_range('1/1/2012', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

rng

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-01 00:00:32', '2012-01-01 00:00:33',
               '2012-01-01 00:00:34', '2012-01-01 00:00:35',
               '2012-01-01 00:00:36', '2012-01-01 00:00:37',
               '2012-01-01 00:00:38', '2012-01-01 00:00:39',
               '2012-01-01 00:00:40', '2012-01-01 00:00:41',
               '2012-01-01 00:00:42', '2012-01-01 00:00:43',
               '2012-01-01 00:00:44', '2012-01-01 00:00:45',
               '2012-01-01 00:00:46', '2012-01-01 00:00:47',
               '2012-01-01 00:00:48', '2012-01-01 00:00:49',
               '2012-01-01 00:00:50', '2012-01-01 00:00:51',
               '2012-01-01 00:00:52', '2012-01-01 00:00:53',
               '2012-01-01 00:00:54', '2012-01-01 00:00:55',
               '2012-01-01 00:00:56', '2012-01-01 00:00:57',
               '2012-01-01 00:00:58', '2012-01-01 00:00:59',
               '2012-01-01 00:01:00', '2012-01-01 00:01:01',
               '2012-01-01 00:01:02', '2012-01-01 00:01:03',
               '2012-01-01 00:01:04', '2012-01-01 00:01:05',
               '2012-01-01 00:01:06', '2012-01-01 00:01:07',
               '2012-01-01 00:01:08', '2012-01-01 00:01:09',
               '2012-01-01 00:01:10', '2012-01-01 00:01:11',
               '2012-01-01 00:01:12', '2012-01-01 00:01:13',
               '2012-01-01 00:01:14', '2012-01-01 00:01:15',
               '2012-01-01 00:01:16', '2012-01-01 00:01:17',
               '2012-01-01 00:01:18', '2012-01-01 00:01:19',
               '2012-01-01 00:01:20', '2012-01-01 00:01:21',
               '2012-01-01 00:01:22', '2012-01-01 00:01:23',
               '2012-01-01 00:01:24', '2012-01-01 00:01:25',
               '2012-01-01 00:01:26', '2012-01-01 00:01:27',
               '2012-01-01 00:01:28', '2012-01-01 00:01:29',
               '2012-01-01 00:01:30', '2012-01-01 00:01:31',
               '2012-01-01 00:01:32', '2012-01-01 00:01:33',
               '2012-01-01 00:01:34', '2012-01-01 00:01:35',
               '2012-01-01 00:01:36', '2012-01-01 00:01:37',
               '2012-01-01 00:01:38', '2012-01-01 00:01:39'],
              dtype='datetime64[ns]', freq='S')

ts

2012-01-01 00:00:00    244
2012-01-01 00:00:01     57
2012-01-01 00:00:02      2
2012-01-01 00:00:03    175
2012-01-01 00:00:04    486
2012-01-01 00:00:05     71
2012-01-01 00:00:06     71
2012-01-01 00:00:07    430
2012-01-01 00:00:08    276
2012-01-01 00:00:09    283
2012-01-01 00:00:10    358
2012-01-01 00:00:11    465
2012-01-01 00:00:12    358
2012-01-01 00:00:13     20
2012-01-01 00:00:14    296
2012-01-01 00:00:15    397
2012-01-01 00:00:16    485
2012-01-01 00:00:17    358
2012-01-01 00:00:18    429
2012-01-01 00:00:19    148
2012-01-01 00:00:20    166
2012-01-01 00:00:21    333
2012-01-01 00:00:22     43
2012-01-01 00:00:23    352
2012-01-01 00:00:24    180
2012-01-01 00:00:25     79
2012-01-01 00:00:26     97
2012-01-01 00:00:27    344
2012-01-01 00:00:28    271
2012-01-01 00:00:29    434
                      ... 
2012-01-01 00:01:10    294
2012-01-01 00:01:11     22
2012-01-01 00:01:12    352
2012-01-01 00:01:13    383
2012-01-01 00:01:14    175
2012-01-01 00:01:15     62
2012-01-01 00:01:16     62
2012-01-01 00:01:17     32
2012-01-01 00:01:18     16
2012-01-01 00:01:19    110
2012-01-01 00:01:20    110
2012-01-01 00:01:21    302
2012-01-01 00:01:22    268
2012-01-01 00:01:23    342
2012-01-01 00:01:24     39
2012-01-01 00:01:25    346
2012-01-01 00:01:26    461
2012-01-01 00:01:27    305
2012-01-01 00:01:28    435
2012-01-01 00:01:29    370
2012-01-01 00:01:30    319
2012-01-01 00:01:31    376
2012-01-01 00:01:32     97
2012-01-01 00:01:33    437
2012-01-01 00:01:34    287
2012-01-01 00:01:35    335
2012-01-01 00:01:36    334
2012-01-01 00:01:37    106
2012-01-01 00:01:38    295
2012-01-01 00:01:39    122
Freq: S, dtype: int64

ts.resample('5Min').sum()

2012-01-01    24806
Freq: 5T, dtype: int64

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

2012-03-06    0.954522
2012-03-07    0.944713
2012-03-08    1.299799
2012-03-09    1.766374
2012-03-10   -0.703189
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')

ts_utc

2012-03-06 00:00:00+00:00    0.954522
2012-03-07 00:00:00+00:00    0.944713
2012-03-08 00:00:00+00:00    1.299799
2012-03-09 00:00:00+00:00    1.766374
2012-03-10 00:00:00+00:00   -0.703189
Freq: D, dtype: float64

# 转为另一个时区
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    0.954522
2012-03-06 19:00:00-05:00    0.944713
2012-03-07 19:00:00-05:00    1.299799
2012-03-08 19:00:00-05:00    1.766374
2012-03-09 19:00:00-05:00   -0.703189
Freq: D, dtype: float64

类别（Categoricals）

从0.15版本起，pandas可以在DataFrame中包含类别数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], 
                   "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

# Series.cat.categories 可以将类别重命名为更有意义的名字
df["grade"].cat.categories = ["very good", "good", "very bad"]

df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])

df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

按类别对数据进行排序

# 统计每个类别出现的次数
df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

画图

# Series
ts = pd.Series(np.random.randn(1000), 
               index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

%matplotlib inline
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7584dafc90>

这里写图片描述

# DataFrame
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, 
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); 
plt.legend(loc='best') #自动调整在最佳位置放置legend图标

<matplotlib.legend.Legend at 0x7f7574834e50>




<matplotlib.figure.Figure at 0x7f7584daf310>

这里写图片描述

读取和保存数据

CSV:
df.to_csv(‘foo.csv’)
pd.read_csv(‘foo.csv’)

HDF5:
df.to_hdf(‘foo.h5’,’df’)
pd.read_hdf(‘foo.h5’,’df’)

Excel:
df.to_excel(‘foo.xlsx’, sheet_name=’Sheet1’)
pd.read_excel(‘foo.xlsx’, ‘Sheet1’, index_col=None, na_values=[‘NA’])

附录

本文是对pandas 0.18.1 documentation进行学习的一次学习记录。
原文见10 Minutes to pandas。虽然号称10分钟入门，但也只限于水过地皮湿的理解程度或作为手头的应急查阅文件。我在jupyter-notebook中一步一步按照代码敲下来，边学边理解大概需要四个小时。