重点
- index是行,column是列
index/selection
Operation Syntax Result Select column df[col] Series Select row by label df.loc[label] Series Select row by integer location df.iloc[loc] Series Slice rows df[5:10] DataFrame Select rows by boolean vector df[bool_vec] DataFrame info(),describe(),pd.set_option()
DataFrame是带标签的二维数据结构,不同列可以支持不同的类型.可以把它想象成电子表格或SQL表或Series的dict.它是pandas中最常用的数据结果. 和Serise一样,DataFrame支持多种不同类型:
* Dict of 1D ndarray, list, dict, or Series
* 2-D numpy.ndarray
* 以结构或记录为元素的ndarray
* Series
* DataFrame
可以选择设置index(行索引)和column(列索引)参数.
From dict of Series or dict
取所有Series的index的并作为输出的index.嵌套的dict会先转换成Series.不设置列时,默认使用字典关键字次序.
译注:创建DataFrame时,如果没有传入index/column,则使用输入数据对应维度的并集,否则只生成输入参数指定的index/column
In [34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [35]: df = pd.DataFrame(d)
In [36]: df
Out[36]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
In [37]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[37]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
In [38]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[38]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
通过index和columns方位行和列
In [39]: df.index
Out[39]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [40]: df.columns
Out[40]: Index(['one', 'two'], dtype='object'
From dict of ndarray/lists
ndarray的长度需要一致,传入的index长度也要一致.不指定index,则长度就是ndarray长度
In [41]: d = {'one' : [1., 2., 3., 4.],
....: 'two' : [4., 3., 2., 1.]}
....:
In [42]: pd.DataFrame(d)
Out[42]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
In [43]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[43]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
From structured or record array
只针对字典数组
In [44]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
In [45]: data[:] = [(1,2.,'Hello'), (2,3.,"World")]
In [46]: pd.DataFrame(data)
Out[46]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
In [47]: pd.DataFrame(data, index=['first', 'second'])
Out[47]:
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'
In [48]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[48]:
C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0
注意: DataFrame和2D numpy ndarray有差别
From a list of dicts
译注:一个字典对应一行, 字典的key对应列
In [49]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [50]: pd.DataFrame(data2)
Out[50]:
a b c
0 1 2 NaN
1 5 10 20.0
In [51]: pd.DataFrame(data2, index=['first', 'second'])
Out[51]:
a b c
first 1 2 NaN
second 5 10 20.0
In [52]: pd.DataFrame(data2, columns=['a', 'b'])
Out[52]:
a b
0 1 2
1 5 10
From a dict of tuples
利用元组+字典可以创建multi-indexed帧
In [53]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[53]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
From a Series
以Series的行作为DataFrame的行,Series的名字作为DataFrame的列(除非另行指定)
Missing Data
用np.nan表示miss data可以创建带有miss data的DataFrame.也可以用numpy.MaskArray指定哪些元素属于miss data
Alternate Constructors
DataFrame.from_dict
输入以dict或类似数组的序列为元素的dict,输出DataFrame.其和DataFrame构造函数类似,但有个orient参数,可以指定dict中每个dict的key是用来作为列还是行.
In [54]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
Out[54]:
A B
0 1 4
1 2 5
2 3 6
默认orient=’columns’,如果设置成orient=’index’, dict的key将作为行索引使用.
In [55]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
....: orient='index', columns=['one', 'two', 'three'])
....:
Out[55]:
one two three
A 1 2 3
B 4 5 6
DataFrame.from_records
输入一个tuple或结构体数组列表,输出DataFrame. 它和普通的DataFrame构造函数的区别是可以指定index
In [56]: data
Out[56]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
In [57]: pd.DataFrame.from_records(data, index='C')
Out[57]:
A B
C
b'Hello' 1 2.0
b'World' 2 3.0
Column selection,addition,deletion
可以把DataFrame看作一个Series字典,列的读取,设置和删除的操作了dict操作类似.
In [58]: df['one']
Out[58]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [59]: df['three'] = df['one'] * df['two']
In [60]: df['flag'] = df['one'] > 2
In [61]: df
Out[61]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
用了dict类似的方法删除列
In [62]: del df['two']
In [63]: three = df.pop('three')
In [64]: df
Out[64]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False
插入一个标量,会自动延拓到整个列
In [65]: df['foo'] = 'bar'
In [66]: df
Out[66]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
如果插入的Series的行和DataFrame不一致,结果和DataFrame为准,有可能补NaN
In [67]: df['one_trunc'] = df['one'][:2]
In [68]: df
Out[68]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN
可以插入ndarray,但其行数要和DataFrame保持一致.
默认插入的列在最后的位置,insert()函数可以指定插入位置
In [69]: df.insert(1, 'bar', df['one'])
In [70]: df
Out[70]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
Assigning New Columns in Method Chains
assign()函数允许基于已有的列创建一个新的列
In [71]: iris = pd.read_csv('data/iris.data')
In [72]: iris.head()
Out[72]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [73]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
....: .head())
....:
Out[73]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.6863
1 4.9 3.0 1.4 0.2 Iris-setosa 0.6122
2 4.7 3.2 1.3 0.2 Iris-setosa 0.6809
3 4.6 3.1 1.5 0.2 Iris-setosa 0.6739
4 5.0 3.6 1.4 0.2 Iris-setosa 0.7200
上面的例子中插入的是一个预先计算好的值.也可以传入只有一个参数的函数.
In [74]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
....: x['SepalLength'])).head()
....:
Out[74]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.6863
1 4.9 3.0 1.4 0.2 Iris-setosa 0.6122
2 4.7 3.2 1.3 0.2 Iris-setosa 0.6809
3 4.6 3.1 1.5 0.2 Iris-setosa 0.6739
4 5.0 3.6 1.4 0.2 Iris-setosa 0.7200
assign()函数返回的是新增了列的副本DataFrame,不修改原始的DataFrame. assign()返回副本DataFrame的特性可以
支持直接显示处理后的DataFrame
In [75]: (iris.query('SepalLength > 5')
....: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
....:
Out[75]: <matplotlib.axes._subplots.AxesSubplot at 0x1c29ff5b70>
plot函数直接显示两个新列,而不必显示获得新DataFrame的句柄
Indexing/Selection
基础的行操作包括
Operation | Syntax | Result |
---|---|---|
Select column | df[col] | Series |
Select row by label | df.loc[label] | Series |
Select row by integer location | df.iloc[loc] | Series |
Slice rows | df[5:10] | DataFrame |
Select rows by boolean vector | df[bool_vec] | DataFrame |
选择单列,返回的是Series,其index是DataFrame的Column
In [80]: df.loc['b']
Out[80]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object
In [81]: df.iloc[2]
Out[81]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
译注:选择多行时,返回的是DataFrame,index和原始index保持一致
Data alignment and arithmetric
DataFrame之间的计算会在行/列上自动匹配,默认的结果是DataFrame行/列的并集
In [82]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
In [83]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
In [84]: df + df2
Out[84]:
A B C D
0 0.0457 -0.0141 1.3809 NaN
1 -0.9554 -1.5010 0.0372 NaN
2 -0.6627 1.5348 -0.8597 NaN
3 -2.4529 1.2373 -0.1337 NaN
4 1.4145 1.9517 -2.3204 NaN
5 -0.4949 -1.6497 -1.0846 NaN
6 -1.0476 -0.7486 -0.8055 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
DataFrame和Series之间的计算,默认的操作是Series的行和DataFrame的列匹配 (译注:即把Series作为一行,然后tile(N,1)到DataFrame相同行数)
In [85]: df - df.iloc[0]
Out[85]:
A B C D
0 0.0000 0.0000 0.0000 0.0000
1 -1.3593 -0.2487 -0.4534 -1.7547
2 0.2531 0.8297 0.0100 -1.9912
3 -1.3111 0.0543 -1.7249 -1.6205
4 0.5730 1.5007 -0.6761 1.3673
5 -1.7412 0.7820 -1.2416 -2.0531
6 -1.2408 -0.8696 -0.1533 0.0004
7 -0.7439 0.4110 -0.9296 -0.2824
8 -1.1949 1.3207 0.2382 -1.4826
9 2.2938 1.8562 0.7733 -1.4465
包含时间的Series和DataFrame之间的计算是个特列
In [86]: index = pd.date_range('1/1/2000', periods=8)
In [87]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
In [88]: df
Out[88]:
A B C
2000-01-01 -1.2268 0.7698 -1.2812
2000-01-02 -0.7277 -0.1213 -0.0979
2000-01-03 0.6958 0.3417 0.9597
2000-01-04 -1.1103 -0.6200 0.1497
2000-01-05 -0.7323 0.6877 0.1764
2000-01-06 0.4033 -0.1550 0.3016
2000-01-07 -2.1799 -1.3698 -0.9542
2000-01-08 1.4627 -1.7432 -0.8266
In [89]: type(df['A'])
Out[89]: pandas.core.series.Series
In [90]: df - df['A']
Out[90]:
2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 \
2000-01-01 NaN NaN NaN
2000-01-02 NaN NaN NaN
2000-01-03 NaN NaN NaN
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 NaN NaN NaN
2000-01-04 00:00:00 ... 2000-01-08 00:00:00 A B C
2000-01-01 NaN ... NaN NaN NaN NaN
2000-01-02 NaN ... NaN NaN NaN NaN
2000-01-03 NaN ... NaN NaN NaN NaN
2000-01-04 NaN ... NaN NaN NaN NaN
2000-01-05 NaN ... NaN NaN NaN NaN
2000-01-06 NaN ... NaN NaN NaN NaN
2000-01-07 NaN ... NaN NaN NaN NaN
2000-01-08 NaN ... NaN NaN NaN NaN
[8 rows x 11 columns]
DataFrame和标量的计算就是对每个元素做处理
In [91]: df * 5 + 2
Out[91]:
A B C
2000-01-01 -4.1341 5.8490 -4.4062
2000-01-02 -1.6385 1.3935 1.5106
2000-01-03 5.4789 3.7087 6.7986
2000-01-04 -3.5517 -1.0999 2.7487
2000-01-05 -1.6617 5.4387 2.8822
2000-01-06 4.0165 1.2252 3.5081
2000-01-07 -8.8993 -4.8492 -2.7710
2000-01-08 9.3135 -6.7158 -2.1330
In [92]: 1 / df
Out[92]:
A B C
2000-01-01 -0.8151 1.2990 -0.7805
2000-01-02 -1.3742 -8.2436 -10.2163
2000-01-03 1.4372 2.9262 1.0420
2000-01-04 -0.9006 -1.6130 6.6779
2000-01-05 -1.3655 1.4540 5.6675
2000-01-06 2.4795 -6.4537 3.3154
2000-01-07 -0.4587 -0.7300 -1.0480
2000-01-08 0.6837 -0.5737 -1.2098
In [93]: df ** 4
Out[93]:
A B C
2000-01-01 2.2653 0.3512 2.6948e+00
2000-01-02 0.2804 0.0002 9.1796e-05
2000-01-03 0.2344 0.0136 8.4838e-01
2000-01-04 1.5199 0.1477 5.0286e-04
2000-01-05 0.2876 0.2237 9.6924e-04
2000-01-06 0.0265 0.0006 8.2769e-03
2000-01-07 22.5795 3.5212 8.2903e-01
2000-01-08 4.5774 9.2332 4.6683e-01
Boolen操作也类似
In [94]: df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
In [95]: df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
In [96]: df1 & df2
Out[96]:
a b
0 False False
1 False True
2 True False
In [97]: df1 | df2
Out[97]:
a b
0 True True
1 True True
2 True True
In [98]: df1 ^ df2
Out[98]:
a b
0 True True
1 True False
2 False True
In [99]: -df1
Out[99]:
a b
0 False True
1 True False
2 False False
Transpose
和numpy类似,用T属性或transpose函数都可以
# only show the first 5 rows
In [100]: df[:5].T
Out[100]:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
A -1.2268 -0.7277 0.6958 -1.1103 -0.7323
B 0.7698 -0.1213 0.3417 -0.6200 0.6877
C -1.2812 -0.0979 0.9597 0.1497 0.1764
DataFrame interoperability with numpy functions
只要元素是数值,numpy中对元素的操作,比如log,exp等函数都可以应用到DataFrame上
In [101]: np.exp(df)
Out[101]:
A B C
2000-01-01 0.2932 2.1593 0.2777
2000-01-02 0.4830 0.8858 0.9068
2000-01-03 2.0053 1.4074 2.6110
2000-01-04 0.3294 0.5380 1.1615
2000-01-05 0.4808 1.9892 1.1930
2000-01-06 1.4968 0.8565 1.3521
2000-01-07 0.1131 0.2541 0.3851
2000-01-08 4.3176 0.1750 0.4375
In [102]: np.asarray(df)
Out[102]:
array([[-1.2268, 0.7698, -1.2812],
[-0.7277, -0.1213, -0.0979],
[ 0.6958, 0.3417, 0.9597],
[-1.1103, -0.62 , 0.1497],
[-0.7323, 0.6877, 0.1764],
[ 0.4033, -0.155 , 0.3016],
[-2.1799, -1.3698, -0.9542],
[ 1.4627, -1.7432, -0.8266]])
dot()函数是DataFrame的矩阵乘法
In [103]: df.T.dot(df)
Out[103]:
A B C
A 11.3419 -0.0598 3.0080
B -0.0598 6.5206 2.0833
C 3.0080 2.0833 4.3105
dot()函数在Series上是点乘
In [104]: s1 = pd.Series(np.arange(5,10))
In [105]: s1.dot(s1)
Out[105]: 255
Console display
很大的DataFrame被分割成几个部分在终端上显示,可以利用info()获得摘要信息
In [106]: baseball = pd.read_csv('data/baseball.csv')
In [107]: print(baseball)
id player year stint ... hbp sh sf gidp
0 88641 womacto01 2006 2 ... 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 ... 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ... ... ... ...
98 89533 aloumo01 2007 1 ... 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 ... 0.0 0.0 0.0 0.0
[100 rows x 23 columns]
In [108]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id 100 non-null int64
player 100 non-null object
year 100 non-null int64
stint 100 non-null int64
team 100 non-null object
lg 100 non-null object
g 100 non-null int64
ab 100 non-null int64
r 100 non-null int64
h 100 non-null int64
X2b 100 non-null int64
X3b 100 non-null int64
hr 100 non-null int64
rbi 100 non-null float64
sb 100 non-null float64
cs 100 non-null float64
bb 100 non-null int64
so 100 non-null float64
ibb 100 non-null float64
hbp 100 non-null float64
sh 100 non-null float64
sf 100 non-null float64
gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.0+ KB
to_string()函数在表格形式显示DataFrame,但是宽度可能和终端不匹配
In [109]: print(baseball.iloc[-20:, :12].to_string())
id player year stint team lg g ab r h X2b X3b
80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 0
81 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 0
82 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 2
83 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 0
84 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 0
85 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 0
86 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 0
87 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 1
88 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 0
89 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 0
90 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 0
91 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 0
92 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 2
93 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 0
94 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 3
95 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0
96 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
97 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0
列比较多时,设置display.width可以修改显示的宽度
In [111]: pd.set_option('display.width', 40) # default is 80
In [112]: pd.DataFrame(np.random.randn(3, 12))
Out[112]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1.262731 1.289997 0.082423 -0.055758 0.536580 -0.489682 0.369374 -0.034571 -2.484478 -0.281461 0.030711 0.109121
1 1.126203 -0.977349 1.474071 -0.064034 -1.282782 0.781836 -1.071357 0.441153 2.353925 0.583787 0.221471 -0.744471
2 0.758527 1.729689 -0.964980 -0.845696 -1.340896 1.846883 -1.328865 1.682706 -1.717693 0.888782 0.228440 0.901805
display.max_colwidth可以修改每一列的宽度
.....: 'path': ["media/user_name/storage/folder_01/filename_01",
.....: "media/user_name/storage/folder_02/filename_02"]}
.....:
In [114]: pd.set_option('display.max_colwidth',30)
In [115]: pd.DataFrame(datafile)
Out[115]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...
In [116]: pd.set_option('display.max_colwidth',100)
In [117]: pd.DataFrame(datafile)
Out[117]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02
DataFrame column attribute access and IPython completion
忽略
Panel
Panel时pandas中的3D结构,新版本将会取消Panel,推荐用MultiIndex替代,此处跳过