pandas官方文档传送门 http://pandas.pydata.org/pandas-docs/stable/
Intro to Data Structures-数据结构介绍
两个重要的数据结构—Series & Dataframe
Series
是一维数组对象,包含一维数组和索引
#创建一个Series
>>> import pandas as pd
>>> s = pd.Series(data, index=index)
data可以是 一个python字典、一个ndarray、一个标量值(比如5)
# 从数组 创建一个Series , 可以指定索引值,若不指定 则为默认值
In [3]: obj = pd.Series([4,7,-5,3])
In [4]: obj
Out[4]:
0 4
1 7
2 -5
3 3
dtype: int64
# 从字典创建一个Series , 字典的键将传递给索引
In [17]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [19]: obj3 = pd.Series(sdata)
In [20]: obj3
Out[20]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [21]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [22]: obj4 = pd.Series(sdata, index=states)
In [23]: obj4
Out[23]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# 从标量值创建Series,必须给出索引,将会根据索引数目创建相应的值
In [24]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[24]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
Series 具有类似ndarray的属性、字典的属性及矢量运算
In [5]: obj.values
Out[5]: array([ 4, 7, -5, 3])
In [6]: obj.index
Out[6]: RangeIndex(start=0, stop=4, step=1)
# 可指定索引名称
In [8]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [9]: obj2
Out[9]:
d 4
b 7
a -5
c 3
dtype: int64
# 根据索引查询和修改值
In [10]: obj2['a']
Out[10]: -5
In [11]: obj2['c']=9
In [12]: obj2
Out[12]:
d 4
b 7
a -5
c 9
dtype: int64
In [14]: 5 in obj
Out[14]: False
# 条件过滤
In [15]: obj2[obj2>5]
Out[15]:
b 7
c 9
dtype: int64
# 四则运算
In [16]: obj2 * 2
Out[16]:
d 8
b 14
a -10
c 18
dtype: int64
In [17]: obj[1:]+obj[:-1]
Out[17]:
0 NaN
1 14.0
2 -10.0
3 NaN
dtype: float64
# 值不存在显示NaN
Seires具有名称属性
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 -0.4949
1 1.0718
2 0.7216
3 -0.7068
4 -1.0396
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
# 通过pandas.Series.rename()方法修改名称
In [30]: s2 = s.rename("different")
In [31]: s2.name
Out[31]: 'different'
DataFrame
是具有不同类型列的二维标签结构,类似SQL表格
创建一个dataframe对象
数据可以来自:
- 一维数组、列表、字典或序列的字典
- 二维numpy数组
- 结构化的ndarray
- A Series
- 其他的DataFrame
# 从Series 或字典 的字典
In [32]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [33]: df = pd.DataFrame(d)
In [34]: df
Out[34]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
In [35]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[35]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
In [36]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[36]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
# 注意: key是列名,一个Series为一列
# 从数组的字典
In [39]: d = {'one' : [1., 2., 3., 4.],
....: 'two' : [4., 3., 2., 1.]}
....:
In [40]: pd.DataFrame(d)
Out[40]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
In [41]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[41]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
# 从结构化数组
In [42]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
In [43]: data[:] = [(1,2.,'Hello'), (2,3.,"World")]
In [44]: pd.DataFrame(data)
Out[44]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
In [45]: pd.DataFrame(data, index=['first', 'second'])
Out[45]:
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'
In [46]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[46]:
C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0
# 从字典的列表
In [47]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [48]: pd.DataFrame(data2)
Out[48]:
a b c
0 1 2 NaN
1 5 10 20.0
In [49]: pd.DataFrame(data2, index=['first', 'second'])
Out[49]:
a b c
first 1 2 NaN
second 5 10 20.0
In [50]: pd.DataFrame(data2, columns=['a', 'b'])
Out[50]:
a b
0 1 2
1 5 10
#从元组的字典
In [51]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[51]:
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
# from records 、from items 等
In [52]: data
Out[52]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
In [53]: pd.DataFrame.from_records(data, index='C')
Out[53]:
A B
C
b'Hello' 1 2.0
b'World' 2 3.0
In [54]: pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
Out[54]:
A B
0 1 4
1 2 5
2 3 6
In [55]: pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
....: orient='index', columns=['one', 'two', 'three'])
....:
Out[55]:
one two three
A 1 2 3
B 4 5 6
列的增加、删除、选择、插入
In [56]: df['one']
Out[56]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [57]: df['three'] = df['one'] * df['two']
In [58]: df['flag'] = df['one'] > 2
In [59]: df
Out[59]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
# 删除
In [60]: del df['two']
In [61]: three = df.pop('three')
In [62]: df
Out[62]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False
In [63]: df['foo'] = 'bar'
In [64]: df
Out[64]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
In [65]: df['one_trunc'] = df['one'][:2]
In [66]: df
Out[66]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN
# 在指定位置插入列 insert(位置,列名,数据)
In [67]: df.insert(1, 'bar', df['one'])
In [68]: df
Out[68]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
DataFrame按索引选取
操作 | 语法 | 结果 |
---|---|---|
选择列 | df[col] | Series |
按标签选择行 | df.loc[label] | Series |
按位置选择行 | df.iloc[loc] | Series |
按行切片 | df[5:10] | DataFrame |
按布尔向量选择行 | df[bool_vec] | DataFrame |
In [75]: df.loc['b']
Out[75]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object
In [76]: df.iloc[2]
Out[76]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
DataFrame计算
# pandas生成时间索引
In [81]: index = pd.date_range('1/1/2000', periods=8)
In [82]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
In [83]: df
Out[83]:
A B C
2000-01-01 -1.2268 0.7698 -1.2812
2000-01-02 -0.7277 -0.1213 -0.0979
2000-01-03 0.6958 0.3417 0.9597
2000-01-04 -1.1103 -0.6200 0.1497
2000-01-05 -0.7323 0.6877 0.1764
2000-01-06 0.4033 -0.1550 0.3016
2000-01-07 -2.1799 -1.3698 -0.9542
2000-01-08 1.4627 -1.7432 -0.8266
In [86]: df * 5 + 2
Out[86]:
A B C
2000-01-01 -4.1341 5.8490 -4.4062
2000-01-02 -1.6385 1.3935 1.5106
2000-01-03 5.4789 3.7087 6.7986
2000-01-04 -3.5517 -1.0999 2.7487
2000-01-05 -1.6617 5.4387 2.8822
2000-01-06 4.0165 1.2252 3.5081
2000-01-07 -8.8993 -4.8492 -2.7710
2000-01-08 9.3135 -6.7158 -2.1330
In [87]: 1 / df
Out[87]:
A B C
2000-01-01 -0.8151 1.2990 -0.7805
2000-01-02 -1.3742 -8.2436 -10.2163
2000-01-03 1.4372 2.9262 1.0420
2000-01-04 -0.9006 -1.6130 6.6779
2000-01-05 -1.3655 1.4540 5.6675
2000-01-06 2.4795 -6.4537 3.3154
2000-01-07 -0.4587 -0.7300 -1.0480
2000-01-08 0.6837 -0.5737 -1.2098
In [88]: df ** 4
Out[88]:
A B C
2000-01-01 2.2653 0.3512 2.6948e+00
2000-01-02 0.2804 0.0002 9.1796e-05
2000-01-03 0.2344 0.0136 8.4838e-01
2000-01-04 1.5199 0.1477 5.0286e-04
2000-01-05 0.2876 0.2237 9.6924e-04
2000-01-06 0.0265 0.0006 8.2769e-03
2000-01-07 22.5795 3.5212 8.2903e-01
2000-01-08 4.5774 9.2332 4.6683e-01
# DataFrame的转置
# only show the first 5 rows
In [95]: df[:5].T
Out[95]:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
A -1.2268 -0.7277 0.6958 -1.1103 -0.7323
B 0.7698 -0.1213 0.3417 -0.6200 0.6877
C -1.2812 -0.0979 0.9597 0.1497 0.1764