熟悉了NumPy之后,接下来就是要学习pandas了。pandas建立在NumPy之上,十分强大,好用。学习的资料就是看pandas官网的文档了。本文就是记录自己的学习笔记。
pandas的数据结构
pandas主要有Series(对映一维数组),DataFrame(对映二维数组),Panel(对映三维数组),Panel4D(对映四维数组),PanelND(多维)等数据结构。应用最多的就是Series和DataFrame了。下面就主要介绍这两类数据结构。
Series
Series是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python对象等。Series可以通过标签来定位。
创建方法
s = pd.Series(data, index=index)
data可以是:
- Python的dict
- Numpy的ndarray
- 一个标量值
从ndarry创建
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: s = pd.Series(np.random.randn(5), index = list('ABCDE'))
In [4]: s
Out[4]:
A -1.130657
B -1.539251
C 1.503126
D 1.266908
E 0.335561
dtype: float64
从dict创建
In [19]: d = {'a': 1, 'b': 2, 'c': 3}
In [20]: pd.Series(d)
Out[20]:
a 1
b 2
c 3
dtype: int64
In [21]: pd.Series(d, index=['b', 'c', 'd', 'a'])
Out[21]:
b 2
c 3
d NaN
a 1
dtype: float64
从标量创建
In [22]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[22]:
a 5
b 5
c 5
d 5
e 5
dtype: float64
In [23]:
Series操作
Series像ndarray一样操作
In [24]: s[0]
Out[24]: -0.06036422206791571
In [25]: s[:3]
Out[25]:
a -0.060364
b 0.315560
c -0.520548
dtype: float64
In [26]: s[s > s.median()]
Out[26]:
a -0.060364
b 0.315560
dtype: float64
In [27]: s[[4, 2, 1]]
Out[27]:
e -1.900474
c -0.520548
b 0.315560
dtype: float64
In [28]: np.exp(s)
Out[28]:
a 0.941422
b 1.371027
c 0.594195
d 0.912396
e 0.149498
dtype: float64
Series像dictionary一样操作
In [30]: s['a'] Out[30]: -0.06036422206791571
In [31]: 'e' in s Out[31]: True
In [32]: s.get('f')
In [33]: s.get('f', np.nan)
Out[33]: nan
In [35]: s['f'] = 3.
In [35]: s['f'] = 3.
In [36]: s
Out[36]:
a -0.060364
b 0.315560
c -0.520548
d -0.091681
e -1.900474
f 3.000000
dtype: float64
如果index不存在,则没有返回值。我么也可以给不存在的值附上nan。
运算操作
Series支持+,-,*, /, exp等NumPy的运算。
In [6]: s + s
Out[6]:
a 0.648688
b -2.729308
c -0.919524
d 0.876880
e 5.863378
f 6.000000
dtype: float64
In [7]: s * 2
Out[7]:
a 0.648688
b -2.729308
c -0.919524
d 0.876880
e 5.863378
f 6.000000
dtype: float64
In [8]: np.exp(s)
Out[8]:
a 1.383123
b 0.255469
c 0.631434
d 1.550287
e 18.759286
f 20.085537
dtype: float64
当两个index不同的Series一起操作时,不同部分值为nan:
In [9]: s[1:] + s[:-1]
Out[9]:
a NaN
b -2.729308
c -0.919524
d 0.876880
e 5.863378
f NaN
dtype: float64
我们也可以给Series命名:
In [10]: s = pd.Series(np.random.randn(5), name='something')
In [11]: s
Out[11]:
0 1.447208
1 -0.546760
2 0.858622
3 0.648803
4 -0.667612
Name: something, dtype: float64
DataFrame
DataFrame是二维的带标签的数据结构。我们可以通过标签来定位数据。这是NumPy所没有的。
创建方法
数据可以从不同类型的输入获得:
- 一维ndarray,列表,字典,字典,或者Series的字典,
- 二维的ndarray
- Series
- 外部数据引入,比如csv, excel等
- 其他的DataFrame
- 等等
我就介绍下怎么用Series的字典创建,其他方法大同小异,可以参考文档。
从Series的字典创建
In [17]: d = {'one': pd.Series([1, 2, 3], index=list('abc')), 'two': pd.Series([1, 2, 3, 4], index=list('abcd'))}
In [18]: df = pd.DataFrame(d)
In [19]: df
Out[19]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
In [20]: df.index
Out[20]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [21]: df.columns
Out[21]: Index(['one', 'two'], dtype='object')
In [28]: df.index=['A', 'B', 'C', 'D'] # 可以更改index
In [29]: df
Out[29]:
one two
A 1 1
B 2 2
C 3 3
D NaN 4
选择、运算操作
我们可以像操作Series一样操作DataFrame。读取,设置,删除列的操作和dict操作类似。
In [22]: df['one'] # 列操作:选择列标签名
Out[22]:
a 1
b 2
c 3
d NaN
Name: one, dtype: float64
In [23]: df['three'] = df['one'] + df['two'] # 创建新列, 和dict一样
In [24]: df['flag'] = df['one'] > 2
In [25]: df
Out[25]:
one two three flag
a 1 1 2 False
b 2 2 4 False
c 3 3 6 True
d NaN 4 NaN False
In [26]: del df['two']
In [27]: three = df.pop('three') # df中弹出three列到three变量
In [28]: df
Out[28]:
one flag
a 1 False
b 2 False
c 3 True
d NaN False
In [29]: three
Out[29]:
a 2
b 4
c 6
d NaN
Name: three, dtype: float64
In [30]: df['foo'] = 'bar'
In [31]: df
Out[31]:
one flag foo
a 1 False bar
b 2 False bar
c 3 True bar
d NaN False bar
In [32]: df['one_trunc'] = df['one'][:2] # 填补的数据为nan
In [33]: df
Out[33]:
one flag foo one_trunc
a 1 False bar 1
b 2 False bar 2
c 3 True bar NaN
d NaN False bar NaN
In [34]: df.insert(1, 'bar', df['one']) # 可以自定义加入列的位置,新增bar列到index为1的列
In [35]: df
Out[35]:
one bar flag foo one_trunc
a 1 1 False bar 1
b 2 2 False bar 2
c 3 3 True bar NaN
d NaN NaN False bar NaN
In [36]: df.assign(ration = df['one'] / df['bar']) # assign操作会把结果储存在DataFrame中
Out[36]:
one bar flag foo one_trunc ration
a 1 1 False bar 1 1
b 2 2 False bar 2 1
c 3 3 True bar NaN 1
d NaN NaN False bar NaN NaN
In [37]: df
Out[37]:
one bar flag foo one_trunc
a 1 1 False bar 1
b 2 2 False bar 2
c 3 3 True bar NaN
d NaN NaN False bar NaN
In [38]: df.loc['b'] # 用loc操作获取行,loc操作需要行的标签
Out[38]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object
In [39]: df.iloc[2] # 用iloc操作根据行列获取数据,iloc[row list, columns list]
Out[39]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
In [40]: df.iloc[2, :] # 选取第二行,除了最后一列的所有列
Out[40]:
one 3
bar 3
flag True
foo bar
Name: c, dtype: object
In [44]: df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
In [46]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=list('ABC'))
In [47]: df + df2 # 两个DataFrame相加,表情不对应得地方,赋nan值
Out[47]:
A B C D
0 1.239121 2.705995 1.365740 NaN
1 1.507655 -1.092202 0.083471 NaN
2 -0.485961 -0.131136 -1.677334 NaN
3 -0.858146 0.319006 -1.995003 NaN
4 -1.487327 2.030991 -0.565237 NaN
5 0.239241 -0.713864 -1.635968 NaN
6 -1.656484 -0.420657 0.125534 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
In [48]: df - df.iloc[0] # 行减操作
Out[48]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.969524 -1.957223 -0.781471 0.278686
2 -2.215996 -0.172781 -1.736314 -1.050741
3 -2.264761 -1.402786 -2.713273 -0.247084
4 -1.157636 -1.445320 -1.985973 1.485799
5 -1.689059 -1.160161 -1.453136 0.588097
6 -3.359694 -1.415710 -0.493772 -1.002543
7 -0.889769 0.220577 -0.023013 0.024337
8 -2.223337 -0.068570 -1.117682 -0.875048
9 -0.678439 -1.591324 0.107048 -0.880545
In [49]: df - df['A'] # 列减操作和行减不一致
Out[49]:
A B C D 0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [53]: df = pd.DataFrame(random.randn(8, 3), index = index, columns=list('ABC'))
In [54]: df
Out[54]:
A B C
2000-01-01 0.581713 0.229262 -0.174359
2000-01-02 1.355298 -0.901488 1.082112
2000-01-03 -0.963151 0.285010 -1.275164
2000-01-04 -0.104592 0.744454 -0.722504
2000-01-05 -0.794036 0.268566 1.721357
2000-01-06 -1.415143 -0.863292 -0.674650
2000-01-07 0.505573 0.451317 -0.390972
2000-01-08 -1.341107 0.549922 0.120314
In [59]: df.sub(df['A'], axis = 0) # 用sub操作实现正真的列减
Out[59]:
A B C
2000-01-01 0 -0.352452 -0.756072
2000-01-02 0 -2.256785 -0.273186
2000-01-03 0 1.248161 -0.312013
2000-01-04 0 0.849046 -0.617911
2000-01-05 0 1.062602 2.515393
2000-01-06 0 0.551851 0.740493
2000-01-07 0 -0.054256 -0.896545
2000-01-08 0 1.891028 1.461421
In [60]: df
Out[60]:
A B C
2000-01-01 0.581713 0.229262 -0.174359
2000-01-02 1.355298 -0.901488 1.082112
2000-01-03 -0.963151 0.285010 -1.275164
2000-01-04 -0.104592 0.744454 -0.722504
2000-01-05 -0.794036 0.268566 1.721357
2000-01-06 -1.415143 -0.863292 -0.674650
2000-01-07 0.505573 0.451317 -0.390972
2000-01-08 -1.341107 0.549922 0.120314
In [70]: df * 5 + 2 # 运算操作
Out[70]:
A B C
2000-01-01 4.908567 3.146308 1.128205
2000-01-02 8.776488 -2.507439 7.410558
2000-01-03 -2.815756 3.425049 -4.375820
2000-01-04 1.477038 5.722268 -1.612518
2000-01-05 -1.970178 3.342832 10.606786
2000-01-06 -5.075717 -2.316460 -1.373250
2000-01-07 4.527865 4.256584 0.045139
2000-01-08 -4.705535 4.749608 2.601572
In [71]: 1 / df
Out[71]:
A B C
2000-01-01 1.719060 4.361830 -5.735293
2000-01-02 0.737845 -1.109277 0.924119
2000-01-03 -1.038259 3.508651 -0.784213
2000-01-04 -9.560923 1.343267 -1.384076
2000-01-05 -1.259389 3.723473 0.580937
2000-01-06 -0.706642 -1.158357 -1.482250
2000-01-07 1.977954 2.215738 -2.557726
2000-01-08 -0.745653 1.818441 8.311556
In [74]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype = bool)
In [75]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype = bool)
In [82]: df1
Out[82]:
a b
0 True False
1 False True
2 True True
In [83]: df2
Out[83]:
a b
0 False True
1 True True
2 True False
### 下面演示的是布尔运算
In [84]: df1 & df2
Out[84]:
a b
0 False False
1 False True
2 True False
In [85]: df1 | df2
Out[85]:
a b
0 True True
1 True True
2 True True
In [86]: df1 ^ df2
Out[86]:
a b
0 True True
1 True False
2 False True
In [87]: -df1
Out[87]:
a b
0 False True
1 True False
2 False False
In [112]: df = pd.DataFrame({'foot1': np.random.randn(5), 'foot2': np.random.randn(5)})
In [113]: df
Out[113]:
foot1 foot2
0 0.953419 -0.901983
1 -0.155681 -0.143213
2 -0.164418 1.519970
3 0.699752 -0.398224
4 -0.550058 2.115899
In [114]: df.T # 矩阵转置
Out[114]:
0 1 2 3 4
foot1 0.953419 -0.155681 -0.164418 0.699752 -0.550058
foot2 -0.901983 -0.143213 1.519970 -0.398224 2.115899
In [116]: df.T.dot(df) # 矩阵相乘
Out[116]:
foot1 foot2
foot1 1.752495 -2.530108
foot2 -2.530108 7.780001
In [120]: np.exp(df) # 同样可以用NumPy的方法
Out[120]:
foot1 foot2
0 2.594566 0.405764
1 0.855832 0.866569
2 0.848387 4.572087
3 2.013253 0.671512
4 0.576916 8.297037
总结
以上就是Pandas主要的数据结构:Series和DataFrame的简介。记录了怎么创建数据以及常用的算数、选取、增、删、修改的操作。
数据挺好理解的。Series相对于一般的数组来说,就是多了一个标签。因此我们也已把它理解为一个“字典“,标签对映字典的key,值对映字典的值。同样的,DataFrame比一般的矩阵多了行和列的标签。列相当于一个Series。所以我们需要加一个列标签。我们可以把它看成”字典的字典“。用字典的字典的key对映于列标签。
以此类推,Panel的n维数据,类似于n个嵌套的字典。最外维的标签对映于最外维字典的key。