10分钟入门pandas

这篇博客介绍了pandas的基本操作,包括创建DataFrame、查看数据、选择数据、处理缺失值、运算、数据融合、分组、时间序列分析以及读写数据等。通过实例展示了如何使用pandas进行数据处理和分析,适合初学者入门。
摘要由CSDN通过智能技术生成
本文是对pandas的一个入门介绍,仅仅针对初学者。如果需要更详细的内容,请移步[Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook). 首先,导入所需要的python包:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
创建对象 ———– pandas中的数据结构包括Series、DataFrame、Panel、Pannel4D等,详细介绍移步[数据结构介绍](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro). 常用的数据结构是前两个:Series和DataFrame。 通过传入一个已有的python列表(list)对象来创建一个Series对象。
s = pd.Series([1,3,4,np.nan,6,8])
s
0 1.0 1 3.0 2 4.0 3 NaN 4 6.0 5 8.0 dtype: float64 通过传入一个numpy数组来构建一个DataFrame对象。使用时间序列作为每行的索引,并为每列数据分配一个列名。
dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’, ‘2013-01-05’, ‘2013-01-06’], dtype=’datetime64[ns]’, freq=’D’)
# 创建DataFrame对象,并指定索引index和列名columns
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-06-0.4049000.585420-0.073923-0.501197

通过传入一个python字典对象来创建一个DataFrame对象。

df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20160102'),
                    'C': pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D': np.array([3]*4, dtype='int32'),
                    'E': pd.Categorical(['test','train','test','train']),
                    'F': 'foo'})
df2
ABCDEF
01.02016-01-021.03testfoo
11.02016-01-021.03trainfoo
21.02016-01-021.03testfoo
31.02016-01-021.03trainfoo
# DataFrame 中每列的数据类型可以不同
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

在ipython中可以使用“Tab”键对DataFrame的列名和公共属性进行自动补全。

查看对象中的数据

查看DataFrame的前几行或最后几行

df.head()
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
df.tail()
ABCD
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-06-0.4049000.585420-0.073923-0.501197
df.head(3)
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457

获取DataFrame的索引、列名、数据(值)。

df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
df.values
array([[-0.28589413,  0.49001051,  0.17112101, -1.54980655],
       [-0.06837701, -0.45280422, -0.39189213, -0.85252018],
       [ 1.30438846, -1.80848416, -0.28648908, -0.43745725],
       [ 1.44781215, -1.86212061,  0.11594994, -0.66413402],
       [ 0.5204089 , -1.4027399 , -0.35604882,  0.4609499 ],
       [-0.40489995,  0.58541997, -0.07392295, -0.5011969 ]])

使用“describe”获取数据的统计信息。

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean0.418906-0.741786-0.136880-0.590694
std0.8081921.1128490.2442130.652884
min-0.404900-1.862121-0.391892-1.549807
25%-0.231515-1.707048-0.338659-0.805424
50%0.226016-0.927772-0.180206-0.582665
75%1.1083940.2543070.068482-0.453392
max1.4478120.5854200.1711210.460950

将DataFrame进行转置。

df.T
2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A-0.285894-0.0683771.3043881.4478120.520409-0.404900
B0.490011-0.452804-1.808484-1.862121-1.4027400.585420
C0.171121-0.391892-0.2864890.115950-0.356049-0.073923
D-1.549807-0.852520-0.437457-0.6641340.460950-0.501197
df
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-06-0.4049000.585420-0.073923-0.501197

对坐标轴进行排序。

df.sort_index(axis=1, ascending=False)
DCBA
2013-01-01-1.5498070.1711210.490011-0.285894
2013-01-02-0.852520-0.391892-0.452804-0.068377
2013-01-03-0.437457-0.286489-1.8084841.304388
2013-01-04-0.6641340.115950-1.8621211.447812
2013-01-050.460950-0.356049-1.4027400.520409
2013-01-06-0.501197-0.0739230.585420-0.404900
df.sort_index(axis=0, ascending=False)
ABCD
2013-01-06-0.4049000.585420-0.073923-0.501197
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-01-0.2858940.4900110.171121-1.549807

对值进行排序。

df.sort_values(by='B')
ABCD
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-06-0.4049000.585420-0.073923-0.501197
df.sort_values(by='B',ascending=False)
ABCD
2013-01-06-0.4049000.585420-0.073923-0.501197
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134

选择数据

pandas中对数据的选择可以使用标准的python/numpy方式。

df['A']
2013-01-01   -0.285894
2013-01-02   -0.068377
2013-01-03    1.304388
2013-01-04    1.447812
2013-01-05    0.520409
2013-01-06   -0.404900
Freq: D, Name: A, dtype: float64
# 选择一个单独的列,将产生一个Series,此时df['A']等价于df.A
df.A
2013-01-01   -0.285894
2013-01-02   -0.068377
2013-01-03    1.304388
2013-01-04    1.447812
2013-01-05    0.520409
2013-01-06   -0.404900
Freq: D, Name: A, dtype: float64

对行进行切片操作。

df[0:3]
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
df['20130103':'20130105']
ABCD
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950

使用标准的python/numpy方法获取数据的方式很直观,但是对于工业级的代码,建议使用优化的pandas数据获取方法,包括:.at,.iat,.iloc和.ix

df
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-06-0.4049000.585420-0.073923-0.501197
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.loc[dates[0]]
A   -0.285894
B    0.490011
C    0.171121
D   -1.549807
Name: 2013-01-01 00:00:00, dtype: float64

按类标选择多坐标轴的数据。

df.loc[:,['A','B']]
AB
2013-01-01-0.2858940.490011
2013-01-02-0.068377-0.452804
2013-01-031.304388-1.808484
2013-01-041.447812-1.862121
2013-01-050.520409-1.402740
2013-01-06-0.4049000.585420
# 在对数据进行切片操作时,两端都会包含,不像python中只含前端不含后端
df.loc['20130102':'20130104',['A','B']]
AB
2013-01-02-0.068377-0.452804
2013-01-031.304388-1.808484
2013-01-041.447812-1.862121
# 当只有一维的时候,返回的数据维数会自动缩减
df.loc['20130105',['A','B']]
A    0.520409
B   -1.402740
Name: 2013-01-05 00:00:00, dtype: float64
df.loc['20130105','A']
0.52040890430486719
# 相对于.loc,.at是一种更快地获取一个标量数据的方法
df.at[dates[0],'A']
-0.28589413005579967

按位置进行选择,传入整数,返回数据。

df
ABCD
2013-01-01-0.2858940.4900110.171121-1.549807
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
2013-01-06-0.4049000.585420-0.073923-0.501197
df.iloc[3]
A    1.447812
B   -1.862121
C    0.115950
D   -0.664134
Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5,0:2]
AB
2013-01-041.447812-1.862121
2013-01-050.520409-1.402740

按整数位置进行数据选取或切片时,方法同python/numpy,从0开始索引,包含前端不含后端。

df.iloc[[1,2,4],[0,2]]
AC
2013-01-02-0.068377-0.391892
2013-01-031.304388-0.286489
2013-01-050.520409-0.356049
# 对行进行切片
df.iloc[1:3,:]
ABCD
2013-01-02-0.068377-0.452804-0.391892-0.852520
2013-01-031.304388-1.808484-0.286489-0.437457
# 对列进行切片
df.iloc[:,1:3]
BC
2013-01-010.4900110.171121
2013-01-02-0.452804-0.391892
2013-01-03-1.808484-0.286489
2013-01-04-1.8621210.115950
2013-01-05-1.402740-0.356049
2013-01-060.585420-0.073923
df.iloc[1,1]
-0.45280421688689004
# .iat 比 .iloc 具有更快的速度
df.iat[1,1]
-0.45280421688689004

使用布尔值进行索引。

df[df.A > 0]
ABCD
2013-01-031.304388-1.808484-0.286489-0.437457
2013-01-041.447812-1.8621210.115950-0.664134
2013-01-050.520409-1.402740-0.3560490.460950
df[df > 0]
ABCD
2013-01-01NaN0.4900110.171121NaN
2013-01-02NaNNaNNaNNaN
2013-01-031.304388NaNNaNNaN
2013-01-041.447812NaN0.115950NaN
2013-01-050.520409NaNNaN0.46095
2013-01-06NaN0.585420NaNNaN

使用isin()方法进行过滤。

df2 = df.copy()
df2['E'] = ['one','one','two','three','four','three']
df2
ABCDE
2013-01-01-0.2858940.4900110.171121-1.549807one
2013-01-02-0.068377-0.452804-0.391892-0.852520one
2013-01-031.304388-1.808484-0.286489-0.437457two
2013-01-041.447812-1.8621210.115950-0.664134three
2013-01-050.520409-1.402740-0.3560490.460950four
2013-01-06-0.4049000.585420-0.073923-0.501197three
df2[df2['E'].isin(['one','four'])]
ABCDE
2013-01-01-0.2858940.4900110.171121-1.549807one
2013-01-02-0.068377-0.452804-0.391892-0.852520one
2013-01-050.520409-1.402740-0.3560490.460950four

设置数据

设置一个新列,自动按索引分配数据。

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
df['F'] = s1
df
ABCDF
2013-01-01-0.2858940.4900110.171121-1.549807NaN
2013-01-02-0.068377-0.452804-0.391892-0.8525201.0
2013-01-031.304388-1.808484-0.286489-0.4374572.0
2013-01-041.447812-1.8621210.115950-0.6641343.0
2013-01-050.520409-1.402740-0.3560490.4609504.0
2013-01-06-0.4049000.585420-0.073923-0.5011975.0

因为s1是从‘20130102’开始的,所以‘20130101’对应的F列值为‘NaN’

df.at[dates[0],'A'] = 0
df
ABCDF
2013-01-010.0000000.4900110.171121-1.549807NaN
2013-01-02-0.068377-0.452804-0.391892-0.8525201.0
2013-01-031.304388-1.808484-0.286489-0.4374572.0
2013-01-041.447812-1.8621210.115950-0.6641343.0
2013-01-050.520409-1.402740-0.3560490.4609504.0
2013-01-06-0.4049000.585420-0.073923-0.5011975.0
df.iat[0,1] = 0
df.loc[:,'D'] = np.array([5] * len(df))
df
ABCDF
2013-01-010.0000000.0000000.1711215NaN
2013-01-02-0.068377-0.452804-0.39189251.0
2013-01-031.304388-1.808484-0.28648952.0
2013-01-041.447812-1.8621210.11595053.0
2013-01-050.520409-1.402740-0.35604954.0
2013-01-06-0.4049000.585420-0.07392355.0
df2 = df.copy()
df2[df2 > 0] = -df2
df2
ABCDF
2013-01-010.0000000.000000-0.171121-5NaN
2013-01-02-0.068377-0.452804-0.391892-5-1.0
2013-01-03-1.304388-1.808484-0.286489-5-2.0
2013-01-04-1.447812-1.862121-0.115950-5-3.0
2013-01-05-0.520409-1.402740-0.356049-5-4.0
2013-01-06-0.404900-0.585420-0.073923-5-5.0

缺失数据

pandas主要使用”np.nan“表示缺失数据,默认是不参与计算的。
“reindex”使我们可以对某个轴上的索引进行增删改操作。这种操作返回的是数据的一个备份。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
ABCDFE
2013-01-010.0000000.0000000.1711215NaN1.0
2013-01-02-0.068377-0.452804-0.39189251.01.0
2013-01-031.304388-1.808484-0.28648952.0NaN
2013-01-041.447812-1.8621210.11595053.0NaN
# 将含有缺失数据的行全部去掉
df1.dropna(how='any')
ABCDFE
2013-01-02-0.068377-0.452804-0.39189251.01.0
# 对缺失数据进行填补
df1.fillna(value=5)
ABCDFE
2013-01-010.0000000.0000000.17112155.01.0
2013-01-02-0.068377-0.452804-0.39189251.01.0
2013-01-031.304388-1.808484-0.28648952.05.0
2013-01-041.447812-1.8621210.11595053.05.0
# 获得缺失值的布尔mask
pd.isnull(df1)
ABCDFE
2013-01-01FalseFalseFalseFalseTrueFalse
2013-01-02FalseFalseFalseFalseFalseFalse
2013-01-03FalseFalseFalseFalseFalseTrue
2013-01-04FalseFalseFalseFalseFalseTrue
df1
ABCDFE
2013-01-010.0000000.0000000.1711215NaN1.0
2013-01-02-0.068377-0.452804-0.39189251.01.0
2013-01-031.304388-1.808484-0.28648952.0NaN
2013-01-041.447812-1.8621210.11595053.0NaN

运算

运算通常不含缺失值。

# 统计运算
df
ABCDF
2013-01-010.0000000.0000000.1711215NaN
2013-01-02-0.068377-0.452804-0.39189251.0
2013-01-031.304388-1.808484-0.28648952.0
2013-01-041.447812-1.8621210.11595053.0
2013-01-050.520409-1.402740-0.35604954.0
2013-01-06-0.4049000.585420-0.07392355.0
df.mean()
A    0.466555
B   -0.823455
C   -0.136880
D    5.000000
F    3.000000
dtype: float64
df.mean(1)
2013-01-01    1.292780
2013-01-02    1.017385
2013-01-03    1.241883
2013-01-04    1.540328
2013-01-05    1.552324
2013-01-06    2.021319
Freq: D, dtype: float64

在具有不同维度的对象之间进行运算时,需要进行对其。pandas会自动沿着特定维度进行扩展操作。

s = pd.Series([1,3,5,np.nan,6,8], index=dates)
s
2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64
s = s.shift(2)
s
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
df
ABCDF
2013-01-010.0000000.0000000.1711215NaN
2013-01-02-0.068377-0.452804-0.39189251.0
2013-01-031.304388-1.808484-0.28648952.0
2013-01-041.447812-1.8621210.11595053.0
2013-01-050.520409-1.402740-0.35604954.0
2013-01-06-0.4049000.585420-0.07392355.0
df.sub(s,axis='index')
ABCDF
2013-01-01NaNNaNNaNNaNNaN
2013-01-02NaNNaNNaNNaNNaN
2013-01-030.304388-2.808484-1.2864894.01.0
2013-01-04-1.552188-4.862121-2.8840502.00.0
2013-01-05-4.479591-6.402740-5.3560490.0-1.0
2013-01-06NaNNaNNaNNaNNaN

sub()是减运算,df减去s时s的维度会自动进行扩展。

apply运算
apply运算将函数作用于数据。

df.apply(np.cumsum)
ABCDF
2013-01-010.0000000.0000000.1711215NaN
2013-01-02-0.068377-0.452804-0.220771101.0
2013-01-031.236011-2.261288-0.507260153.0
2013-01-042.683824-4.123409-0.391310206.0
2013-01-053.204233-5.526149-0.7473592510.0
2013-01-062.799333-4.940729-0.8212823015.0
df
ABCDF
2013-01-010.0000000.0000000.1711215NaN
2013-01-02-0.068377-0.452804-0.39189251.0
2013-01-031.304388-1.808484-0.28648952.0
2013-01-041.447812-1.8621210.11595053.0
2013-01-050.520409-1.402740-0.35604954.0
2013-01-06-0.4049000.585420-0.07392355.0

np.cumsum是求元素累加和,上述操作将每行数据依次累加到下一行上。

df.apply(lambda x: x.max()-x.min())
A    1.852712
B    2.447541
C    0.563013
D    0.000000
F    4.000000
dtype: float64

柱状图
统计每个数据出现的次数。

# 随机生成0~7之间的10个整数
s = pd.Series(np.random.randint(0,7,size=10))
s
0    1
1    5
2    0
3    4
4    3
5    5
6    6
7    6
8    5
9    1
dtype: int64
s.value_counts()
5    3
6    2
1    2
4    1
3    1
0    1
dtype: int64

字符串方法

s = pd.Series(['A','B','C','Aaba','Baca',np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

数据融合

pandas提供了多种工具可以将Series、DataFrame和Panel对象按照多种逻辑结合起来。

使用concat()连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df
0123
00.5268892.038465-0.5642200.263579
1-0.987904-0.3061951.8052460.030639
21.288416-0.5146340.4507020.671194
30.209680-0.8686040.5535080.173013
4-0.443213-0.998113-0.237519-0.401295
50.5952070.845315-0.9147251.471180
6-0.539326-0.6817760.4916642.022497
71.0830120.5187380.707878-0.337114
8-1.3220830.495178-0.223462-1.511751
9-0.105515-0.2565681.5919260.755486
# 将df切成片
pieces = [df[:3], df[3:7], df[7:]]
pieces
[          0         1         2         3
 0  0.526889  2.038465 -0.564220  0.263579
 1 -0.987904 -0.306195  1.805246  0.030639
 2  1.288416 -0.514634  0.450702  0.671194,
           0         1         2         3
 3  0.209680 -0.868604  0.553508  0.173013
 4 -0.443213 -0.998113 -0.237519 -0.401295
 5  0.595207  0.845315 -0.914725  1.471180
 6 -0.539326 -0.681776  0.491664  2.022497,
           0         1         2         3
 7  1.083012  0.518738  0.707878 -0.337114
 8 -1.322083  0.495178 -0.223462 -1.511751
 9 -0.105515 -0.256568  1.591926  0.755486]
pd.concat(pieces)
0123
00.5268892.038465-0.5642200.263579
1-0.987904-0.3061951.8052460.030639
21.288416-0.5146340.4507020.671194
30.209680-0.8686040.5535080.173013
4-0.443213-0.998113-0.237519-0.401295
50.5952070.845315-0.9147251.471180
6-0.539326-0.6817760.4916642.022497
71.0830120.5187380.707878-0.337114
8-1.3220830.495178-0.223462-1.511751
9-0.105515-0.2565681.5919260.755486

join
SQL风格的数据融合。

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left
keylval
0foo1
1foo2
right
keyrval
0foo4
1foo5
pd.merge(left, right, on='key')
keylvalrval
0foo14
1foo15
2foo24
3foo25

append
为dataframe增加行。

df = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])
df
ABCD
0-0.346194-1.8786280.2571690.445530
11.098394-1.127943-1.251522-0.653498
21.296878-0.757345-2.423548-2.233024
30.857649-0.3204090.267631-1.337814
40.0905671.4607390.212409-0.308281
50.9517211.3050340.7219960.669566
60.1043951.904366-0.1320590.436476
70.552328-1.3445390.4590061.713434
s = df.iloc[3]
df.append(s, ignore_index=True)
ABCD
0-0.346194-1.8786280.2571690.445530
11.098394-1.127943-1.251522-0.653498
21.296878-0.757345-2.423548-2.233024
30.857649-0.3204090.267631-1.337814
40.0905671.4607390.212409-0.308281
50.9517211.3050340.7219960.669566
60.1043951.904366-0.1320590.436476
70.552328-1.3445390.4590061.713434
80.857649-0.3204090.267631-1.337814

分组

“group by” 表示以下步骤中的一步或多步操作。
- 将数据按照某些标准分为多组
- 对每个组进行一个函数运算
- 将结果结合成一个数据结构

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
ABCD
0fooone0.460761-0.001011
1barone2.0010100.282712
2footwo-1.171306-0.085701
3barthree0.7239221.013934
4footwo0.566774-0.654899
5bartwo0.6534831.013699
6fooone0.072918-0.590657
7foothree-0.161579-0.485670
df.groupby('A').sum()
CD
A
bar3.3784152.310345
foo-0.232432-1.817937
df.groupby(['A','B']).sum()
CD
AB
barone2.0010100.282712
three0.7239221.013934
two0.6534831.013699
fooone0.533679-0.591667
three-0.161579-0.485670
two-0.604532-0.740600

Reshaping

stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
AB
firstsecond
barone0.0553340.953745
two1.7193610.419879
bazone0.1802380.844578
two0.233350-1.366278
fooone-0.285023-0.353144
two-1.531769-0.146243
quxone-0.4192700.308597
two0.7630190.631118
df2 = df[:4]
df2
AB
firstsecond
barone0.0553340.953745
two1.7193610.419879
bazone0.1802380.844578
two0.233350-1.366278
stacked = df2.stack()
stacked
first  second   
bar    one     A    0.055334
               B    0.953745
       two     A    1.719361
               B    0.419879
baz    one     A    0.180238
               B    0.844578
       two     A    0.233350
               B   -1.366278
dtype: float64

stack()方法将DataFrame的列压缩了一个级别

对于一个以MultiIndex为索引的stacked DataFrame或Series,stack()的逆操作是unstack().

stacked.unstack()
AB
firstsecond
barone0.0553340.953745
two1.7193610.419879
bazone0.1802380.844578
two0.233350-1.366278
stacked.unstack(0)
firstbarbaz
second
oneA0.0553340.180238
B0.9537450.844578
twoA1.7193610.233350
B0.419879-1.366278
stacked.unstack(1)
secondonetwo
first
barA0.0553341.719361
B0.9537450.419879
bazA0.1802380.233350
B0.844578-1.366278

数据透视表Pivot Tables

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo0.353420-0.570327
1oneBfoo1.090713-0.046794
2twoCfoo-0.1608740.595251
3threeAbar0.884684-0.027981
4oneBbar0.379335-0.387736
5oneCbar0.0456741.210791
6twoAfoo0.264520-1.120149
7threeBfoo1.1490120.213768
8oneCfoo-0.965242-0.232711
9oneAbar-0.4640230.799239
10twoBbar0.186186-0.889300
11threeCbar0.1779921.352036
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA-0.4640230.353420
B0.3793351.090713
C0.045674-0.965242
threeA0.884684NaN
BNaN1.149012
C0.177992NaN
twoANaN0.264520
B0.186186NaN
CNaN-0.160874

时间序列

pandas具有简单、强大、高效的用于频率变换的重采样操作(例如将季节性数据变为以5分钟为间隔的数据)。

rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
rng
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-01 00:00:32', '2012-01-01 00:00:33',
               '2012-01-01 00:00:34', '2012-01-01 00:00:35',
               '2012-01-01 00:00:36', '2012-01-01 00:00:37',
               '2012-01-01 00:00:38', '2012-01-01 00:00:39',
               '2012-01-01 00:00:40', '2012-01-01 00:00:41',
               '2012-01-01 00:00:42', '2012-01-01 00:00:43',
               '2012-01-01 00:00:44', '2012-01-01 00:00:45',
               '2012-01-01 00:00:46', '2012-01-01 00:00:47',
               '2012-01-01 00:00:48', '2012-01-01 00:00:49',
               '2012-01-01 00:00:50', '2012-01-01 00:00:51',
               '2012-01-01 00:00:52', '2012-01-01 00:00:53',
               '2012-01-01 00:00:54', '2012-01-01 00:00:55',
               '2012-01-01 00:00:56', '2012-01-01 00:00:57',
               '2012-01-01 00:00:58', '2012-01-01 00:00:59',
               '2012-01-01 00:01:00', '2012-01-01 00:01:01',
               '2012-01-01 00:01:02', '2012-01-01 00:01:03',
               '2012-01-01 00:01:04', '2012-01-01 00:01:05',
               '2012-01-01 00:01:06', '2012-01-01 00:01:07',
               '2012-01-01 00:01:08', '2012-01-01 00:01:09',
               '2012-01-01 00:01:10', '2012-01-01 00:01:11',
               '2012-01-01 00:01:12', '2012-01-01 00:01:13',
               '2012-01-01 00:01:14', '2012-01-01 00:01:15',
               '2012-01-01 00:01:16', '2012-01-01 00:01:17',
               '2012-01-01 00:01:18', '2012-01-01 00:01:19',
               '2012-01-01 00:01:20', '2012-01-01 00:01:21',
               '2012-01-01 00:01:22', '2012-01-01 00:01:23',
               '2012-01-01 00:01:24', '2012-01-01 00:01:25',
               '2012-01-01 00:01:26', '2012-01-01 00:01:27',
               '2012-01-01 00:01:28', '2012-01-01 00:01:29',
               '2012-01-01 00:01:30', '2012-01-01 00:01:31',
               '2012-01-01 00:01:32', '2012-01-01 00:01:33',
               '2012-01-01 00:01:34', '2012-01-01 00:01:35',
               '2012-01-01 00:01:36', '2012-01-01 00:01:37',
               '2012-01-01 00:01:38', '2012-01-01 00:01:39'],
              dtype='datetime64[ns]', freq='S')
ts
2012-01-01 00:00:00    244
2012-01-01 00:00:01     57
2012-01-01 00:00:02      2
2012-01-01 00:00:03    175
2012-01-01 00:00:04    486
2012-01-01 00:00:05     71
2012-01-01 00:00:06     71
2012-01-01 00:00:07    430
2012-01-01 00:00:08    276
2012-01-01 00:00:09    283
2012-01-01 00:00:10    358
2012-01-01 00:00:11    465
2012-01-01 00:00:12    358
2012-01-01 00:00:13     20
2012-01-01 00:00:14    296
2012-01-01 00:00:15    397
2012-01-01 00:00:16    485
2012-01-01 00:00:17    358
2012-01-01 00:00:18    429
2012-01-01 00:00:19    148
2012-01-01 00:00:20    166
2012-01-01 00:00:21    333
2012-01-01 00:00:22     43
2012-01-01 00:00:23    352
2012-01-01 00:00:24    180
2012-01-01 00:00:25     79
2012-01-01 00:00:26     97
2012-01-01 00:00:27    344
2012-01-01 00:00:28    271
2012-01-01 00:00:29    434
                      ... 
2012-01-01 00:01:10    294
2012-01-01 00:01:11     22
2012-01-01 00:01:12    352
2012-01-01 00:01:13    383
2012-01-01 00:01:14    175
2012-01-01 00:01:15     62
2012-01-01 00:01:16     62
2012-01-01 00:01:17     32
2012-01-01 00:01:18     16
2012-01-01 00:01:19    110
2012-01-01 00:01:20    110
2012-01-01 00:01:21    302
2012-01-01 00:01:22    268
2012-01-01 00:01:23    342
2012-01-01 00:01:24     39
2012-01-01 00:01:25    346
2012-01-01 00:01:26    461
2012-01-01 00:01:27    305
2012-01-01 00:01:28    435
2012-01-01 00:01:29    370
2012-01-01 00:01:30    319
2012-01-01 00:01:31    376
2012-01-01 00:01:32     97
2012-01-01 00:01:33    437
2012-01-01 00:01:34    287
2012-01-01 00:01:35    335
2012-01-01 00:01:36    334
2012-01-01 00:01:37    106
2012-01-01 00:01:38    295
2012-01-01 00:01:39    122
Freq: S, dtype: int64
ts.resample('5Min').sum()
2012-01-01    24806
Freq: 5T, dtype: int64
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06    0.954522
2012-03-07    0.944713
2012-03-08    1.299799
2012-03-09    1.766374
2012-03-10   -0.703189
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00    0.954522
2012-03-07 00:00:00+00:00    0.944713
2012-03-08 00:00:00+00:00    1.299799
2012-03-09 00:00:00+00:00    1.766374
2012-03-10 00:00:00+00:00   -0.703189
Freq: D, dtype: float64
# 转为另一个时区
ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00    0.954522
2012-03-06 19:00:00-05:00    0.944713
2012-03-07 19:00:00-05:00    1.299799
2012-03-08 19:00:00-05:00    1.766374
2012-03-09 19:00:00-05:00   -0.703189
Freq: D, dtype: float64

类别(Categoricals)

从0.15版本起,pandas可以在DataFrame中包含类别数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], 
                   "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
# Series.cat.categories 可以将类别重命名为更有意义的名字
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
df.sort_values(by="grade")
idraw_gradegrade
56every bad
12bgood
23bgood
01avery good
34avery good
45avery good

按类别对数据进行排序

# 统计每个类别出现的次数
df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

画图

# Series
ts = pd.Series(np.random.randn(1000), 
               index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
%matplotlib inline
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f7584dafc90>

这里写图片描述

# DataFrame
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, 
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); 
plt.legend(loc='best') #自动调整在最佳位置放置legend图标
<matplotlib.legend.Legend at 0x7f7574834e50>




<matplotlib.figure.Figure at 0x7f7584daf310>

这里写图片描述

读取和保存数据

CSV:
df.to_csv(‘foo.csv’)
pd.read_csv(‘foo.csv’)

HDF5:
df.to_hdf(‘foo.h5’,’df’)
pd.read_hdf(‘foo.h5’,’df’)

Excel:
df.to_excel(‘foo.xlsx’, sheet_name=’Sheet1’)
pd.read_excel(‘foo.xlsx’, ‘Sheet1’, index_col=None, na_values=[‘NA’])

附录

本文是对pandas 0.18.1 documentation进行学习的一次学习记录。
原文见10 Minutes to pandas。虽然号称10分钟入门,但也只限于水过地皮湿的理解程度或作为手头的应急查阅文件。我在jupyter-notebook中一步一步按照代码敲下来,边学边理解大概需要四个小时。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值