10 Minutes to pandas
http://pandas.pydata.org/pandas-docs/stable/10min.html
安装
支持的python版本: 2.7, 3.5, 3.6
$ pip install pandas
检查本地的pandas运行环境是否完整,可以运行pandas的单元测试用例
$ pip install pytest>>> import pandas as pd>>> pd.test()
获取当前使用pandas的版本信息
>>> import pandas as pd>>> pd.__version__'0.21.1'
概览
pandas的基本数据结构:
- Series: 一维数据
- DataFrame: 二维数据
- Panel: 三维数据(从0.20.0版本开始,已经不再推荐使用)
- Panel4D, PanelND(不再推荐使用)
DataFrame是由Series构成的
创建Series
创建Series最简单的方法
>>> s = pd.Series(data, index=index)
data可以是不同的类型:
- python字典
- ndarray
- 标量(比如: 5)
使用ndarray创建(From ndarray)
如果data是ndarray,那么index的长度必须和data的长度相同,当没有明确index参数时,默认使用[0, ... len(data) - 1]作为index。
>>> import pandas as pd>>> import numpy as np>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])>>> sa 0.654385b 0.055691 c 0.856054d 0.621810e 1.802872dtype: float64>>> s.indexIndex(['a', 'b', 'c', 'd', 'e'], dtype='object')>>> pd.Series(np.random.randn(5))0 -0.4671831 -1.3333232 -0.4938133 -0.0677054 -1.310332dtype: float64
需要注意的是: pandas里的索引并不要求唯一性,如果一个操作不支持重复的索引,会自动抛出异常。这么做的原因是很多操作不会用到索引,比如GroupBy。
>>> s = pd.Series(np.random.randn(5), index=['a', 'a', 'a', 'a', 'a'])>>> sa 0.847331a -2.138021a -0.364763a -0.603172a 0.363691dtype: float64
使用dict创建(From dict)
当data是dict类型时,如果指定了index参数,那么就使用index参数作为索引。否者,就使用排序后的data的key作为index。
>>> d = {'b': 0., 'a': 1., 'c': 2.}# 索引的值是排序后的>>> pd.Series(d)a 1.0b 0.0c 2.0dtype: float64# 字典中不存在的key, 直接赋值为NaN(Not a number)>>> pd.Series(d, index=['b', 'c', 'd', 'a'])b 0.0c 2.0d NaNa 1.0dtype: float64
使用标量创建(From scalar value)
当data是标量时,必须提供index, 值会被重复到index的长度
>>> pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64
创建DataFrame
DataFrame是一个二维的数据结构,可以看做是一个excel表格或一张SQL表,或者值为Series的字典。 跟Series一样,DataFrame也可以通过多种类型的数据结构来创建
- 字典(包含一维ndarray数组,列表,字典或Series)
- 二维的ndarray数组
- 结构化的ndarray
- Series
- 另一个DataFrame
除了data之外,还接受index和columns参数来分布指定行和列的标签
从Series字典或嵌套的字典创建(From dict of Series or dicts)
结果的索引是多个Series索引的合集,如果没有指定columns,就用排序后的字典的key作为列标签。
>>> d = {'one': pd.Series([1,2,3], index=['a', 'b', 'c']),... 'two': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}...>>> df = pd.DataFrame(d)>>> df one twoa 1.0 1b 2.0 2c 3.0 3d NaN 4>>> pd.DataFrame(d, index=['d', 'b', 'a']) one twod NaN 4b 2.0 2a 1.0 1>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) two threed 4 NaNb 2 NaNa 1 NaN>>> df.indexIndex(['a', 'b', 'c', 'd'], dtype='object')>>> df.columnsIndex(['one', 'two'], dtype='object')
从ndarray类型/列表类型的字典(From dict of ndarrays / lists)
>>> d = {'one': [1,2,3,4], 'two': [4,3,2,1]}>>> pd.DataFrame(d) one two0 1 41 2 32 3 23 4 1>>> pd.DataFrame(d, index=['a', 'b', 'c', 'd']) one twoa 1 4b 2 3c 3 2d 4 1
从结构化ndarray创建(From structured or record array)
>>> data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])>>> dataarray([(0, 0., b''), (0, 0., b'')], dtype=[('A', '>> data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]>>> pd.DataFrame(data) A B C0 1 2.0 b'Hello'1 2 3.0 b'World'>>> pd.DataFrame(data, index=['first', 'second']) A B Cfirst 1 2.0 b'Hello'second 2 3.0 b'World'>>> pd.DataFrame(data, index=['first', 'second'], columns=['C', 'A', 'B']) C A Bfirst b'Hello' 1 2.0second b'World' 2 3.0
从字典列表里创建(a list of dicts)
>>> data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]>>> pd.DataFrame(data2) a b c0 1 2 NaN1 5 10 20.0>>> pd.DataFrame(data2, index=["first", "second"]) a b cfirst 1 2 NaNsecond 5 10 20.0>>> pd.DataFrame(data2, columns=["a", "b"]) a b0 1 21 5 10
从元祖字典创建(From a dict of tuples)
通过元祖字典,可以创建多索引的DataFrame
>>> pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},... ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},... ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},... ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},... ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})... a b a b c a bA B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
通过Series创建(From a Series)
>>> pd.DataFrame(pd.Series([1,2,3])) 00 11 22 3
查看数据
>>> dates = pd.date_range('20130101', periods=6)>>> datesDatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))>>> df A B C D2013-01-01 1.231897 -0.169839 1.333295 0.3671422013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.2036392013-01-04 0.044990 -0.255158 -1.213848 1.0767152013-01-05 0.418213 0.107400 0.619448 1.4940872013-01-06 -1.831020 0.813526 0.403101 -1.251946# 获取前几行(默认前5行)>>> df.head() A B C D2013-01-01 1.231897 -0.169839 1.333295 0.3671422013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.2036392013-01-04 0.044990 -0.255158 -1.213848 1.0767152013-01-05 0.418213 0.107400 0.619448 1.494087# 获取后3行>>> df.tail(3) A B C D2013-01-04 0.044990 -0.255158 -1.213848 1.0767152013-01-05 0.418213 0.107400 0.619448 1.4940872013-01-06 -1.831020 0.813526 0.403101 -1.251946# 获取索引>>> df.indexDatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')# 获取列信息>>> df.columnsIndex(['A', 'B', 'C', 'D'], dtype='object')# 获取数据信息>>> df.valuesarray([[ 1.23189704, -0.16983942, 1.3332949 , 0.36714191], [-0.12744988, -1.71667129, 0.91034961, 0.15118638], [-0.24165226, -0.98464711, 0.78865554, -0.20363944], [ 0.04498958, -0.25515787, -1.21384804, 1.07671506], [ 0.41821265, 0.10740007, 0.61944799, 1.49408712], [-1.8310196 , 0.81352564, 0.40310115, -1.25194611]]) # 获取简单的统计信息 >>> df.describe() A B C Dcount 6.000000 6.000000 6.000000 6.000000mean -0.084170 -0.367565 0.473500 0.272257std 1.007895 0.880134 0.883494 0.970912min -1.831020 -1.716671 -1.213848 -1.25194625% -0.213102 -0.802275 0.457188 -0.11493350% -0.041230 -0.212499 0.704052 0.25916475% 0.324907 0.038090 0.879926 0.899322max 1.231897 0.813526 1.333295 1.494087# 转置矩阵>>> df.T 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06A 1.231897 -0.127450 -0.241652 0.044990 0.418213 -1.831020B -0.169839 -1.716671 -0.984647 -0.255158 0.107400 0.813526C 1.333295 0.910350 0.788656 -1.213848 0.619448 0.403101D 0.367142 0.151186 -0.203639 1.076715 1.494087 -1.251946# 按照列排序>>> df.sort_values(by='B') A B C D2013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.2036392013-01-04 0.044990 -0.255158 -1.213848 1.0767152013-01-01 1.231897 -0.169839 1.333295 0.3671422013-01-05 0.418213 0.107400 0.619448 1.4940872013-01-06 -1.831020 0.813526 0.403101 -1.251946
选择数据
获取
选择列, 返回的是Series
>>> df['A']2013-01-01 1.2318972013-01-02 -0.1274502013-01-03 -0.2416522013-01-04 0.0449902013-01-05 0.4182132013-01-06 -1.831020Freq: D, Name: A, dtype: float64>>> df.A2013-01-01 1.2318972013-01-02 -0.1274502013-01-03 -0.2416522013-01-04 0.0449902013-01-05 0.4182132013-01-06 -1.831020Freq: D, Name: A, dtype: float64
选择行
>>> df[0:3] A B C D2013-01-01 1.231897 -0.169839 1.333295 0.3671422013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.203639>>> df["20130102":"20130104"] A B C D2013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.2036392013-01-04 0.044990 -0.255158 -1.213848 1.076715
通过Label选择
# 返回的Series>>> df.loc[dates[0]]A 1.231897B -0.169839C 1.333295D 0.367142Name: 2013-01-01 00:00:00, dtype: float64# 返回的DateFrame>>> df.loc[:, ['A', 'B']] A B2013-01-01 1.231897 -0.1698392013-01-02 -0.127450 -1.7166712013-01-03 -0.241652 -0.9846472013-01-04 0.044990 -0.2551582013-01-05 0.418213 0.1074002013-01-06 -1.831020 0.813526>>> df.loc['20130102':'20130104',['A','B']] A B2013-01-02 -0.127450 -1.7166712013-01-03 -0.241652 -0.9846472013-01-04 0.044990 -0.255158# 降维返回>>> df.loc['20130102',['A','B']]A -0.127450B -1.716671Name: 2013-01-02 00:00:00, dtype: float64
通过Position选择
# 返回第4行>>> df.iloc[3]A 0.044990B -0.255158C -1.213848D 1.076715Name: 2013-01-04 00:00:00, dtype: float64>>> df.iloc[3:5,0:2] A B2013-01-04 0.044990 -0.2551582013-01-05 0.418213 0.107400>>> df.iloc[1:3, :] A B C D2013-01-02 -0.127450 -1.716671 0.910350 0.1511862013-01-03 -0.241652 -0.984647 0.788656 -0.203639# 获得指定位置的元素>>> df.iloc[1,1]-1.7166712884342545>>> df.iat[1,1]-1.7166712884342545
布尔索引
>>> df[df.A > 0] A B C D2013-01-01 1.231897 -0.169839 1.333295 0.3671422013-01-04 0.044990 -0.255158 -1.213848 1.0767152013-01-05 0.418213 0.107400 0.619448 1.494087>>> df[df > 0] A B C D2013-01-01 1.231897 NaN 1.333295 0.3671422013-01-02 NaN NaN 0.910350 0.1511862013-01-03 NaN NaN 0.788656 NaN2013-01-04 0.044990 NaN NaN 1.0767152013-01-05 0.418213 0.107400 0.619448 1.4940872013-01-06 NaN 0.813526 0.403101 NaN>>> df2=df.copy()>>> df2['E'] = ['one','one','two','three','four','three']>>> df2 A B C D E2013-01-01 1.231897 -0.169839 1.333295 0.367142 one2013-01-02 -0.127450 -1.716671 0.910350 0.151186 one2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two2013-01-04 0.044990 -0.255158 -1.213848 1.076715 three2013-01-05 0.418213 0.107400 0.619448 1.494087 four2013-01-06 -1.831020 0.813526 0.403101 -1.251946 three# 使用isin()来过滤>>> df2[df2['E'].isin(['two', 'four'])] A B C D E2013-01-03 -0.241652 -0.984647 0.788656 -0.203639 two2013-01-05 0.418213 0.107400 0.619448 1.494087 four
赋值
根据日期新增加一列
>>> s12013-01-02 12013-01-03 22013-01-04 32013-01-05 42013-01-06 52013-01-07 6Freq: D, dtype: int64>>> df['F'] = s1>>> df A B C D F2013-01-01 1.231897 -0.169839 1.333295 0.367142 NaN2013-01-02 -0.127450 -1.716671 0.910350 0.151186 1.02013-01-03 -0.241652 -0.984647 0.788656 -0.203639 2.02013-01-04 0.044990 -0.255158 -1.213848 1.076715 3.02013-01-05 0.418213 0.107400 0.619448 1.494087 4.02013-01-06 -1.831020 0.813526 0.403101 -1.251946 5.0# 通过label赋值>>> df.at[dates[0], 'A'] = 0# 通过position赋值>>> df.iat[0,1] = 0# 通过ndarray赋值>>> df.loc[:, 'D'] = np.array([5] * len(df))>>> df A B C D F2013-01-01 0.000000 0.000000 1.333295 5 NaN2013-01-02 -0.127450 -1.716671 0.910350 5 1.02013-01-03 -0.241652 -0.984647 0.788656 5 2.02013-01-04 0.044990 -0.255158 -1.213848 5 3.02013-01-05 0.418213 0.107400 0.619448 5 4.02013-01-06 -1.831020 0.813526 0.403101 5 5.0# 通过where操作>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))>>> df A B C D2013-01-01 -1.231777 -0.068987 -0.105402 1.5120762013-01-02 -1.120426 -0.240417 0.223964 -0.5597932013-01-03 0.697097 0.758780 -1.191408 -0.7938822013-01-04 0.332519 0.784564 0.805932 -1.1691862013-01-05 0.010235 0.156115 0.419567 -2.2792142013-01-06 0.294819 -0.691370 0.294119 -0.208475>>> df2 = df.copy()>>> df2[df > 0] = -df2>>> df2 A B C D2013-01-01 -1.231777 -0.068987 -0.105402 -1.5120762013-01-02 -1.120426 -0.240417 -0.223964 -0.5597932013-01-03 -0.697097 -0.758780 -1.191408 -0.7938822013-01-04 -0.332519 -0.784564 -0.805932 -1.1691862013-01-05 -0.010235 -0.156115 -0.419567 -2.2792142013-01-06 -0.294819 -0.691370 -0.294119 -0.208475
数据缺失
pandas使用np.nan来表示缺失的数据,它默认不参与任何运算
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])>>> df1 A B C D F E2013-01-01 0.000000 0.000000 1.333295 5 NaN NaN2013-01-02 -0.127450 -1.716671 0.910350 5 1.0 NaN2013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN>>> df1.loc[dates[0]:dates[1], 'E'] = 1>>> df1 A B C D F E2013-01-01 0.000000 0.000000 1.333295 5 NaN 1.02013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.02013-01-03 -0.241652 -0.984647 0.788656 5 2.0 NaN2013-01-04 0.044990 -0.255158 -1.213848 5 3.0 NaN# 丢弃所有包含NaN的行>>> df1.dropna(how='any') A B C D F E2013-01-02 -0.12745 -1.716671 0.91035 5 1.0 1.0# 填充所有包含NaN的元素>>> df1.fillna(value=5) A B C D F E2013-01-01 0.000000 0.000000 1.333295 5 5.0 1.02013-01-02 -0.127450 -1.716671 0.910350 5 1.0 1.02013-01-03 -0.241652 -0.984647 0.788656 5 2.0 5.02013-01-04 0.044990 -0.255158 -1.213848 5 3.0 5.0# 获取元素值为nan的布尔掩码>>> pd.isna(df1) A B C D F E2013-01-01 False False False False True False2013-01-02 False False False False False False2013-01-03 False False False False False True2013-01-04 False False False False False True
运算操作
Stats统计
运算操作都会排除NaN元素
>>> dates = pd.date_range('20130101', periods=6)>>> df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=list('ABCD'))>>> df A B C D2013-01-01 0 1 2 32013-01-02 4 5 6 72013-01-03 8 9 10 112013-01-04 12 13 14 152013-01-05 16 17 18 192013-01-06 20 21 22 23# 计算列的平均值>>> df.mean()A 10.0B 11.0C 12.0D 13.0dtype: float64计算行的平均值>>> df.mean(1)2013-01-01 1.52013-01-02 5.52013-01-03 9.52013-01-04 13.52013-01-05 17.52013-01-06 21.5Freq: D, dtype: float64# shift(n),按照列的方向,从上往下移动n个位置>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)>>> s2013-01-01 NaN2013-01-02 NaN2013-01-03 1.02013-01-04 3.02013-01-05 5.02013-01-06 NaNFreq: D, dtype: float64# sub函数,DataFrame相减操作, 等于 df-s >>> df.sub(s, axis='index') A B C D2013-01-01 NaN NaN NaN NaN2013-01-02 NaN NaN NaN NaN2013-01-03 7.0 8.0 9.0 10.02013-01-04 9.0 10.0 11.0 12.02013-01-05 11.0 12.0 13.0 14.02013-01-06 NaN NaN NaN NaN
Apply
>>> df A B C D2013-01-01 0 1 2 32013-01-02 4 5 6 72013-01-03 8 9 10 112013-01-04 12 13 14 152013-01-05 16 17 18 192013-01-06 20 21 22 23# 在列方向累加>>> df.apply(np.cumsum) A B C D2013-01-01 0 1 2 32013-01-02 4 6 8 102013-01-03 12 15 18 212013-01-04 24 28 32 362013-01-05 40 45 50 552013-01-06 60 66 72 78# 列方向的最大值-最小值, 得到的是一个Series>>> df.apply(lambda x: x.max() - x.min())A 20B 20C 20D 20dtype: int64
直方图 Histogramming
>>> s = pd.Series(np.random.randint(0, 7, size=10))>>> s0 61 52 03 24 55 16 37 38 39 1dtype: int64# 索引是出现的数字,值是次数>>> s.value_counts()3 35 21 26 12 10 1dtype: int64
字符串方法
>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])>>> s.str.lower()0 a1 b2 c3 aaba4 baca5 NaN6 caba7 dog8 catdtype: object
合并
Concat
>>> df = pd.DataFrame(np.random.randn(10, 4))>>> df 0 1 2 30 -1.710767 -2.107488 1.441790 0.9599241 0.509422 0.099733 0.845039 0.2324622 -0.609247 0.533162 -0.387640 0.6688033 0.946219 -0.326805 1.245303 1.3360904 -1.069114 0.755313 -1.003991 -0.3270095 1.169418 -1.225637 -2.137500 1.7663416 -1.751095 0.279439 0.018053 1.8004357 -0.328828 -1.513893 1.879333 0.9452178 2.440123 -0.260918 -0.232951 -1.3377759 -0.876878 -1.153583 -1.487573 -1.509871# 分成小块>>> pieces = [df[:3], df[3:7], df[7:]]# 合并>>> pd.concat(pieces) 0 1 2 30 -1.710767 -2.107488 1.441790 0.9599241 0.509422 0.099733 0.845039 0.2324622 -0.609247 0.533162 -0.387640 0.6688033 0.946219 -0.326805 1.245303 1.3360904 -1.069114 0.755313 -1.003991 -0.3270095 1.169418 -1.225637 -2.137500 1.7663416 -1.751095 0.279439 0.018053 1.8004357 -0.328828 -1.513893 1.879333 0.9452178 2.440123 -0.260918 -0.232951 -1.3377759 -0.876878 -1.153583 -1.487573 -1.509871
Join
跟数据库的Join操作一样
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})>>> left key lval0 foo 11 foo 2>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})>>> right key rval0 foo 41 foo 5>>> pd.merge(left, right, on='key') key lval rval0 foo 1 41 foo 1 52 foo 2 43 foo 2 5
另一个例子
>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})>>> left key lval0 foo 11 bar 2>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})>>> right key rval0 foo 41 bar 5>>> pd.merge(left, right, on='key') key lval rval0 foo 1 41 bar 2 5
Append
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])>>> df A B C D0 -1.521762 -0.850721 1.322354 -0.2265621 -2.773304 -0.663303 0.895075 -0.1715242 0.322975 -0.796484 0.379920 0.0283333 -0.350795 1.839747 -0.359241 -0.0279214 -0.945340 1.062598 -2.208670 0.7690275 -0.329458 -0.145658 1.580258 -1.4148206 -0.261757 -1.435025 -0.512306 -0.2222877 -0.994207 -1.219057 0.781283 -1.795741>>> s = df.iloc[3]>>> df.append(s, ignore_index=True) A B C D0 -1.521762 -0.850721 1.322354 -0.2265621 -2.773304 -0.663303 0.895075 -0.1715242 0.322975 -0.796484 0.379920 0.0283333 -0.350795 1.839747 -0.359241 -0.0279214 -0.945340 1.062598 -2.208670 0.7690275 -0.329458 -0.145658 1.580258 -1.4148206 -0.261757 -1.435025 -0.512306 -0.2222877 -0.994207 -1.219057 0.781283 -1.7957418 -0.350795 1.839747 -0.359241 -0.027921
Grouping
group by的操作需要经过以下1个或多个步骤
- 根据条件分组数据(Spliting)
- 在各个分组上执行函数(Applying)
- 合并结果(Combining) >>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ... 'foo', 'bar', 'foo', 'foo'], ... 'B' : ['one', 'one', 'two', 'three', ... 'two', 'two', 'one', 'three'], ... 'C' : np.arange(1, 9), ... 'D' : np.arange(2, 10)}) ... ... >>> df A B C D 0 foo one 1 2 1 bar one 2 3 2 foo two 3 4 3 bar three 4 5 4 foo two 5 6 5 bar two 6 7 6 foo one 7 8 7 foo three 8 9 # 分组求和 >>> df.groupby('A').sum() C D A bar 12 15 foo 24 29 # 多列分组 >>> df.groupby(['A','B']).sum() C D A B bar one 2 3 three 4 5 two 6 7 foo one 8 10 three 8 9 two 8 10 >>> b = df.groupby(['A','B']).sum() # 多索引 >>> b.index MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']], labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], names=['A', 'B']) >>> b.columns Index(['C', 'D'], dtype='object')
Reshaping
Stack
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',... 'foo', 'foo', 'qux', 'qux'],... ['one', 'two', 'one', 'two',... 'one', 'two', 'one', 'two']]))...>>> tuples[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])>>> df A Bfirst secondbar one 0.096893 0.479194 two -0.771606 0.331693baz one -0.022540 0.531284 two -0.039843 1.876942foo one 0.250473 1.163931 two -1.127163 1.447566qux one -0.410361 -0.734333 two -0.461247 0.018531 >>> df2 = df[:4]>>> df2 A Bfirst secondbar one 0.096893 0.479194 two -0.771606 0.331693baz one -0.022540 0.531284 two -0.039843 1.876942>>> stacked = df2.stack()>>> stackedfirst secondbar one A 0.096893 B 0.479194 two A -0.771606 B 0.331693baz one A -0.022540 B 0.531284 two A -0.039843 B 1.876942dtype: float64>>> type(stacked)pandas.core.series.Series>>> stacked.indexMultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']], labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]], names=['first', 'second', None])>>> stacked.valuesarray([ 0.09689327, 0.47919417, -0.77160574, 0.3316934 , -0.02253955, 0.53128436, -0.03984337, 1.8769416 ]) >>> stacked.unstack() A Bfirst secondbar one 0.096893 0.479194 two -0.771606 0.331693baz one -0.022540 0.531284 two -0.039843 1.876942 >>> stacked.unstack(1)second one twofirstbar A 0.096893 -0.771606 B 0.479194 0.331693baz A -0.022540 -0.039843 B 0.531284 1.876942>>> stacked.unstack(0)first bar bazsecondone A 0.096893 -0.022540 B 0.479194 0.531284two A -0.771606 -0.039843 B 0.331693 1.876942
数据透视表(Pivot Tables)
时间序列
pandas在时间序列上,提供了很方便的按照频率重新采样的功能,在财务分析上非常有用
# 把每秒的数据按5分钟聚合>>> rng = pd.date_range('1/1/2012', periods=100, freq='S')>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)>>> ts.resample('5Min').sum()2012-01-01 22073Freq: 5T, dtype: int64
加上时区信息
>>> rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')>>> ts = pd.Series(np.random.randn(len(rng)), rng)>>> ts2012-03-06 -0.3869742012-03-07 0.6577852012-03-08 1.3902342012-03-09 0.4129042012-03-10 -1.189340Freq: D, dtype: float64>>> ts_utc = ts.tz_localize('UTC')>>> ts_utc2012-03-06 00:00:00+00:00 -0.3869742012-03-07 00:00:00+00:00 0.6577852012-03-08 00:00:00+00:00 1.3902342012-03-09 00:00:00+00:00 0.4129042012-03-10 00:00:00+00:00 -1.189340Freq: D, dtype: float64
转换成另一个时区
>>> ts_utc.tz_convert('Asia/Shanghai')2012-03-06 08:00:00+08:00 -0.3869742012-03-07 08:00:00+08:00 0.6577852012-03-08 08:00:00+08:00 1.3902342012-03-09 08:00:00+08:00 0.4129042012-03-10 08:00:00+08:00 -1.189340Freq: D, dtype: float64
时间跨度转换
>>> rng = pd.date_range('1/1/2012', periods=5, freq='M')>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)>>> ts2012-01-31 0.8251742012-02-29 -2.1902582012-03-31 -0.0731712012-04-30 -0.4042082012-05-31 0.245025Freq: M, dtype: float64>>> ps = ts.to_period()>>> ps2012-01 0.8251742012-02 -2.1902582012-03 -0.0731712012-04 -0.4042082012-05 0.245025Freq: M, dtype: float64>>> ps.to_timestamp()2012-01-01 0.8251742012-02-01 -2.1902582012-03-01 -0.0731712012-04-01 -0.4042082012-05-01 0.245025Freq: MS, dtype: float64
转换季度时间
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')>>> ts = pd.Series(np.random.randn(len(prng)), prng)>>> ts.head()1990Q1 -0.5900401990Q2 -0.7503921990Q3 -0.3855171990Q4 -0.3808061991Q1 -1.252727Freq: Q-NOV, dtype: float64>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9>>> ts.head()1990-03-01 09:00 -0.5900401990-06-01 09:00 -0.7503921990-09-01 09:00 -0.3855171990-12-01 09:00 -0.3808061991-03-01 09:00 -1.252727Freq: H, dtype: float64
Categoricals分类
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})>>> df id raw_grade0 1 a1 2 b2 3 b3 4 a4 5 a5 6 e
转换原始类别为分类数据类型
>>> df["grade"] = df["raw_grade"].astype("category")>>> df id raw_grade grade0 1 a a1 2 b b2 3 b b3 4 a a4 5 a a5 6 e e>>> df["grade"]0 a1 b2 b3 a4 a5 eName: grade, dtype: categoryCategories (3, object): [a, b, e]
重命名分类为更有意义的名称
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]>>> df id raw_grade grade0 1 a very good1 2 b good2 3 b good3 4 a very good4 5 a very good5 6 e very bad
重新安排顺分类,同时添加缺少的分类(序列 .cat方法下返回新默认序列)
>>> df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])>>> df id raw_grade grade0 1 a very good1 2 b good2 3 b good3 4 a very good4 5 a very good5 6 e very bad>>> df["grade"]0 very good1 good2 good3 very good4 very good5 very badName: grade, dtype: categoryCategories (5, object): [very bad, bad, medium, good, very good]
按照分类排序
>>> df.sort_values(by="grade") id raw_grade grade5 6 e very bad1 2 b good2 3 b good0 1 a very good3 4 a very good4 5 a very good
按照分类分组,同时也会显示空的分类
>>> df.groupby("grade").size()gradevery bad 1bad 0medium 0good 2very good 3dtype: int64
Plotting
>>> import matplotlib.pyplot as plt>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))>>> ts = ts.cumsum()>>> ts.plot()>>> plt.show()
画图带图例的图
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A','B'... ,'C', 'D'])>>> df.cumsum()>>> plt.figure();df.plot();plt.legend(loc='best')>>> plt.show()
数据In/Out
CSV
保存到csv文件
>>> df.to_csv('foo.csv')
从csv文件读取数据
>>> pd.read_csv('foo.csv')
HDF5
保存到HDF5仓库
>>> df.to_hdf('foo.h5','df')
从仓库读取
>>> pd.read_hdf('foo.h5','df')
Excel
保存到excel
>>> df.to_excel('foo.xlsx', sheet_name='Sheet1')
从excel文件读取
>>> pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
扩展阅读
https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html