索引
s = pd. Series( np. random. rand( 5 ) , index= list ( 'abcde' ) )
s
a 0.566924
b 0.603482
c 0.341639
d 0.164181
e 0.137200
dtype: float64
s. index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s. index. name = 'alpha'
s
alpha
a 0.566924
b 0.603482
c 0.341639
d 0.164181
e 0.137200
dtype: float64
df = pd. DataFrame( np. random. randn( 4 , 3 ) , columns= [ 'one' , 'two' , 'three' ] )
df
one two three 0 0.131579 1.122725 -1.920064 1 0.383557 -0.039625 -0.716855 2 -0.391827 -1.886698 -1.131089 3 -0.116749 1.369576 -1.314630
df. index
Int64Index([0, 1, 2, 3], dtype='int64')
df. columns
Index([u'one', u'two', u'three'], dtype='object')
通过以下语句,可以查看pandas都有哪些索引
pd. * Index?
pd.CategoricalIndex
pd.DatetimeIndex
pd.Float64Index
pd.Index
pd.Int64Index
pd.IntervalIndex
pd.MultiIndex
pd.PeriodIndex
pd.RangeIndex
pd.TimedeltaIndex
pd.UInt64Index
重复索引
s = pd. Series( np. arange( 6 ) , index= list ( 'abcbda' ) )
s
a 0
b 1
c 2
b 3
d 4
a 5
dtype: int64
s[ 'a' ]
a 0
a 5
dtype: int64
s[ 'c' ]
2
s. index. is_unique
False
s. index. unique( )
array(['a', 'b', 'c', 'd'], dtype=object)
s. groupby( s. index) . sum ( )
a 5
b 4
c 2
d 4
dtype: int64
层次化索引
可以使数据在一个轴上有多个索引级别。即可以用二维的数据表达更高维度的数据,使数据组织方式更清晰。它使用 pd.MultiIndex 类来表示。
层次化索引有什么作用?
比如我们在分析股票数据,我们的一级行索引可以是日期;二级行索引可以是股票代码,列索引可以是股票的交易量,开盘价,收盘价等等。这样我们就可以把多个股票放在同一个时间维度下进行考察和分析。
Series 多层索引
a = [ [ 'a' , 'a' , 'a' , 'b' , 'b' , 'c' , 'c' ] , [ 1 , 2 , 3 , 1 , 2 , 2 , 3 ] ]
tuples = list ( zip ( * a) )
tuples
[('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('c', 2), ('c', 3)]
index = pd. MultiIndex. from_tuples( tuples, names= [ 'first' , 'second' ] )
index
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2], [0, 1, 2, 0, 1, 1, 2]],
names=[u'first', u'second'])
s = pd. Series( np. random. randn( 7 ) , index= index)
s
first second
a 1 -1.192113
2 0.226627
3 0.390052
b 1 0.045297
2 1.552926
c 2 0.014007
3 -0.257103
dtype: float64
s. index
MultiIndex(levels=[[u'a', u'b', u'c'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2], [0, 1, 2, 0, 1, 1, 2]],
names=[u'first', u'second'])
s. index. levels[ 0 ]
Index([u'a', u'b', u'c'], dtype='object', name=u'first')
s. index. levels[ 1 ]
Int64Index([1, 2, 3], dtype='int64', name=u'second')
s[ 'b' ]
second
1 0.045297
2 1.552926
dtype: float64
s[ 'b' : 'c' ]
first second
b 1 0.045297
2 1.552926
c 2 0.014007
3 -0.257103
dtype: float64
s[ [ 'b' , 'a' ] ]
first second
b 1 0.045297
2 1.552926
a 1 -1.192113
2 0.226627
3 0.390052
dtype: float64
s[ 'b' , 1 ]
0.045297227336673768
s[ : , 2 ]
first
a 0.226627
b 1.552926
c 0.014007
dtype: float64
DataFrame 多层索引
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 4 , 3 ) ) ,
index= [ [ 'a' , 'a' , 'b' , 'b' ] , [ 1 , 2 , 1 , 2 ] ] ,
columns= [ [ 'one' , 'one' , 'two' ] , [ 'blue' , 'red' , 'blue' ] ] )
df. index. names = [ 'row-1' , 'row-2' ]
df. columns. names = [ 'col-1' , 'col-2' ]
df
col-1 one two col-2 blue red blue row-1 row-2 a 1 7 6 7 2 1 5 9 b 1 3 5 5 2 2 8 3
df. loc[ 'a' ]
col-1 one two col-2 blue red blue row-2 1 7 6 7 2 1 5 9
type ( df. loc[ 'a' ] )
pandas.core.frame.DataFrame
df. loc[ 'a' , 1 ]
col-1 col-2
one blue 7
red 6
two blue 7
Name: (a, 1), dtype: int64
df. loc[ 'a' , 1 ] . index
MultiIndex(levels=[[u'one', u'two'], [u'blue', u'red']],
labels=[[0, 0, 1], [0, 1, 0]],
names=[u'col-1', u'col-2'])
自动变成了Series对象了
索引交换及排序
df
col-1 one two col-2 blue red blue row-1 row-2 a 1 7 6 7 2 1 5 9 b 1 3 5 5 2 2 8 3
df2 = df. swaplevel( 'row-1' , 'row-2' )
df2
col-1 one two col-2 blue red blue row-2 row-1 1 a 7 6 7 2 a 1 5 9 1 b 3 5 5 2 b 2 8 3
默认交换完成后还是按照以前的排序状态,这里需要重新排序外层
df2. sortlevel( 0 )
col-1 one two col-2 blue red blue row-2 row-1 1 a 7 6 7 b 3 5 5 2 a 1 5 9 b 2 8 3
按照索引级别进行统计
df
col-1 one two col-2 blue red blue row-1 row-2 a 1 7 6 7 2 1 5 9 b 1 3 5 5 2 2 8 3
df. sum ( level= 0 )
col-1 one two col-2 blue red blue row-1 a 8 11 16 b 5 13 8
df. sum ( level= 1 )
col-1 one two col-2 blue red blue row-2 1 10 11 12 2 3 13 12
索引与列的转换
df = pd. DataFrame( {
'a' : range ( 7 ) ,
'b' : range ( 7 , 0 , - 1 ) ,
'c' : [ 'one' , 'one' , 'one' , 'two' , 'two' , 'two' , 'two' ] ,
'd' : [ 0 , 1 , 2 , 0 , 1 , 2 , 3 ]
} )
df
a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
df. set_index( 'c' )
a b d c one 0 7 0 one 1 6 1 one 2 5 2 two 3 4 0 two 4 3 1 two 5 2 2 two 6 1 3
df2 = df. set_index( [ 'c' , 'd' ] )
df2
a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
df3 = df2. reset_index( ) . sort_index( 'columns' )
df3
a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
一次性将之前set_index的内容全部变回去
分组计算
分组计算三步曲:拆分 -> 应用 -> 合并
拆分:根据什么进行分组? 应用:每个分组进行什么样的计算? 合并:把每个分组的计算结果合并起来。
df = pd. DataFrame( { 'key1' : [ 'a' , 'a' , 'b' , 'b' , 'a' ] ,
'key2' : [ 'one' , 'two' , 'one' , 'two' , 'one' ] ,
'data1' : np. random. randint( 1 , 10 , 5 ) ,
'data2' : np. random. randint( 1 , 10 , 5 ) } )
df
data1 data2 key1 key2 0 1 6 a one 1 5 9 a two 2 4 7 b one 3 3 7 b two 4 3 5 a one
对 Series 进行分组
通过索引对齐关联起来
grouped = df[ 'data1' ] . groupby( df[ 'key1' ] )
grouped. mean( )
key1
a 3.0
b 3.5
Name: data1, dtype: float64
df[ 'data1' ] . groupby( [ df[ 'key1' ] , df[ 'key2' ] ] ) . mean( )
key1 key2
a one 2
two 5
b one 4
two 3
Name: data1, dtype: int3
也可以自己分组,这时的分组是行的标识
key= [ 1 , 2 , 1 , 1 , 2 ]
df[ 'data1' ] . groupby( key) . mean( )
1 6.333333
2 1.500000
Name: data1, dtype: float64
这里就是将第一、三、四行分成一组叫做1,二、五行一组叫做2
对 DataFrame 进行分组
df. groupby( 'key1' ) . mean( )
data1 data2 key1 a 3.0 6.666667 b 3.5 7.000000
在运算过程中,如果不是数字就被丢弃了
means = df. groupby( [ 'key1' , 'key2' ] ) . mean( ) [ 'data1' ]
means
key1 key2
a one 2
two 5
b one 4
two 3
Name: data1, dtype: float64
means. unstack( )
df. groupby( [ 'key1' , 'key2' ] ) [ 'data1' ] . mean( )
key1 key2
a one 2
two 5
b one 4
two 3
Name: data1, dtype: int32
每个分组的元素个数
df. groupby( [ 'key1' , 'key2' ] ) . size( )
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
对分组进行迭代
for name, group in df. groupby( 'key1' ) :
print name
print group
a
data1 data2 key1 key2
0 1 6 a one
1 5 9 a two
4 3 5 a one
b
data1 data2 key1 key2
2 4 7 b one
3 3 7 b two
for name, group in df. groupby( [ 'key1' , 'key2' ] ) :
print name
print group
('a', 'one')
data1 data2 key1 key2
0 1 6 a one
4 3 5 a one
('a', 'two')
data1 data2 key1 key2
1 5 9 a two
('b', 'one')
data1 data2 key1 key2
2 4 7 b one
('b', 'two')
data1 data2 key1 key2
3 3 7 b two
转化为字典
d = dict ( list ( df. groupby( 'key1' ) ) )
d
{'a':
data1 data2 key1 key2
0 1 6 a one
1 5 9 a two
4 3 5 a one,
'b':
data1 data2 key1 key2
2 4 7 b one
3 3 7 b two}
d[ 'a' ]
data1 data2 key1 key2 0 1 6 a one 1 5 9 a two 4 3 5 a one
按列分组
df. dtypes
data1 int32
data2 int32
key1 object
key2 object
dtype: object
grouped = df. groupby( df. dtypes, axis= 1 )
dict ( list ( grouped) )
{dtype('int32'):
data1 data2
0 1 6
1 5 9
2 4 7
3 3 7
4 3 5,
dtype('O'):
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
通过字典进行分组
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) ,
columns= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index= [ 'Alice' , 'Bob' , 'Candy' , 'Dark' , 'Emily' ] )
df
a b c d e Alice 7 2 7 9 9 Bob 8 7 7 6 8 Candy 4 2 6 6 5 Dark 6 3 9 4 8 Emily 8 4 9 6 5
df. ix[ 1 , 1 : 3 ] = np. NaN
df
a b c d e Alice 7 2 7 9 9 Bob 8 NaN NaN 6 8 Candy 4 2 6 6 5 Dark 6 3 9 4 8 Emily 8 4 9 6 5
mapping = { 'a' : 'red' , 'b' : 'red' , 'c' : 'blue' , 'd' : 'orange' , 'e' : 'blue' }
grouped = df. groupby( mapping, axis= 1 )
grouped. sum ( )
blue orange red Alice 16 9 9 Bob 8 6 8 Candy 11 6 6 Dark 17 4 9 Emily 14 6 12
具体共计某个元素,size是统计有哪些列
grouped. count( )
blue orange red Alice 2 1 2 Bob 1 1 1 Candy 2 1 2 Dark 2 1 2 Emily 2 1 2
grouped. size( )
blue 2
orange 1
red 2
dtype: int64
通过函数来分组
当函数作为分组依据时,数据表里的每个索引(可以是行索引,也可以是列索引)都会调用一次函数,函数的返回值作为分组的索引,即相同的返回值分在同一组。
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) ,
columns= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index= [ 'Alice' , 'Bob' , 'Candy' , 'Dark' , 'Emily' ] )
df
a b c d e Alice 7 9 1 9 1 Bob 6 6 7 1 5 Candy 7 8 5 3 8 Dark 3 4 6 8 1 Emily 1 2 2 1 2
def _dummy_group ( idx) :
print idx
return idx
df. groupby( _dummy_group)
Alice
Bob
Candy
Dark
Emily
<pandas.core.groupby.DataFrameGroupBy object at 0x07525650>
df. groupby( _dummy_group, axis= 1 )
a
b
c
d
e
<pandas.core.groupby.DataFrameGroupBy object at 0x07525A10>
grouped = df. groupby( len )
grouped. sum ( )
a b c d e 3 6 6 7 1 5 4 3 4 6 8 1 5 15 19 8 13 11
grouped. size( )
3 1
4 1
5 3
dtype: int64
grouped. count( )
多级索引数据根据索引级别来分组
columns = pd. MultiIndex. from_arrays( [ [ 'China' , 'USA' , 'China' , 'USA' , 'China' ] ,
[ 'A' , 'A' , 'B' , 'C' , 'B' ] ] , names= [ 'country' , 'index' ] )
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) , columns= columns)
df
country China USA China USA China index A A B C B 0 9 6 9 6 2 1 5 6 1 8 7 2 2 5 4 5 2 3 4 8 9 4 9 4 7 2 9 1 8
df. groupby( level= 'country' , axis= 1 ) . count( )
country China USA 0 3 2 1 3 2 2 3 2 3 3 2 4 3 2
必须指定axis,否则会报错
df. groupby( level= 'country' , axis= 1 ) . sum ( )
country China USA 0 20 12 1 13 14 2 8 10 3 22 12 4 24 3
df. groupby( level= 'index' , axis= 1 ) . count( )
index A B C 0 2 2 1 1 2 2 1 2 2 2 1 3 2 2 1 4 2 2 1
数据聚合
分组运算,先根据一定规则拆分后的数据,然后对数据进行聚合运算,如前面见到的 mean()
, sum()
等就是聚合的例子。聚合时,拆分后的第一个索引指定的数据都会依次传给聚合函数进行运算。最后再把运算结果合并起来,生成最终结果。
聚合函数除了内置的 sum()
, min()
, max()
, mean()
等等之外,还可以自定义聚合函数。自定义聚合函数时,使用 agg()
或 aggregate()
函数。
count
是查看每个元素的个数,size
是对聚合的行或列进行统计,describe
是应用一些列的聚合函数进行统计
内置聚合函数
df = pd. DataFrame( { 'key1' : [ 'a' , 'a' , 'b' , 'b' , 'a' ] ,
'key2' : [ 'one' , 'two' , 'one' , 'two' , 'one' ] ,
'data1' : np. random. randint( 1 , 10 , 5 ) ,
'data2' : np. random. randint( 1 , 10 , 5 ) } )
df
data1 data2 key1 key2 0 9 3 a one 1 3 8 a two 2 9 5 b one 3 8 5 b two 4 9 2 a one
df[ 'data1' ] . groupby( df[ 'key1' ] ) . sum ( )
key1
a 21
b 17
Name: data1, dtype: int32
自定义聚合函数
def peak_verbose ( s) :
print type ( s)
return s. max ( ) - s. min ( )
def peak ( s) :
return s. max ( ) - s. min ( )
grouped = df. groupby( 'key1' )
grouped. agg( peak_verbose)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
应用多个聚合函数
grouped[ 'data1' , 'data2' ] . agg( [ 'mean' , 'std' , peak] )
data1 data2 mean std peak mean std peak key1 a 7.0 3.464102 6 4.333333 3.21455 6 b 8.5 0.707107 1 5.000000 0.00000 0
grouped[ 'data1' ] . agg( [ ( 'agerage' , 'mean' ) , ( 'max-range' , peak) ] )
agerage max-range key1 a 7.0 6 b 8.5 1
本来是一个列表,如果是起名就是在列表中加入元组
给不同的列应用不同的聚合函数
使用 dict 作为参数来实现
d = { 'data1' : [ 'mean' , peak, 'max' , 'min' ] ,
'data2' : 'sum' }
grouped. agg( d)
data1 data2 mean peak max min sum key1 a 7.0 6 9 3 13 b 8.5 1 9 8 10
自定义函数不需要加引号
重置索引
grouped. agg( d) . reset_index( )
key1 data1 data2 mean peak max min sum 0 a 7.0 6 9 3 13 1 b 8.5 1 9 8 10
df. groupby( 'key1' , as_index= False ) . agg( d)
key1 data1 data2 mean peak max min sum 0 a 7.0 6 9 3 13 1 b 8.5 1 9 8 10
这个方法默认跟上述方法得到的结果一致as_index=False
,表示分组不算索引那一列
分组运算和转换
groupby 是特殊的分组运算。更一般的分组运算包括 “拆分 - 应用 - 合并”。这里介绍 transform()
和 apply()
来实现分组运算。
transform
df = pd. DataFrame( { 'key1' : [ 'a' , 'a' , 'b' , 'b' , 'a' ] ,
'key2' : [ 'one' , 'two' , 'one' , 'two' , 'one' ] ,
'data1' : np. random. randint( 1 , 10 , 5 ) ,
'data2' : np. random. randint( 1 , 10 , 5 ) } )
df
data1 data2 key1 key2 0 4 9 a one 1 5 2 a two 2 1 9 b one 3 3 9 b two 4 1 8 a one
k1_mean = df. groupby( 'key1' ) . mean( ) . add_prefix( 'mean_' )
k1_mean
mean_data1 mean_data2 key1 a 3.333333 6.333333 b 2.000000 9.000000
pd. merge( df, k1_mean, left_on= 'key1' , right_index= True )
data1 data2 key1 key2 mean_data1 mean_data2 0 4 9 a one 3.333333 6.333333 1 5 2 a two 3.333333 6.333333 4 1 8 a one 3.333333 6.333333 2 1 9 b one 2.000000 9.000000 3 3 9 b two 2.000000 9.000000
k1_mean = df. groupby( 'key1' ) . transform( np. mean) . add_prefix( 'mean_' )
k1_mean
mean_data1 mean_data2 0 3.333333 6.333333 1 3.333333 6.333333 2 2.000000 9.000000 3 2.000000 9.000000 4 3.333333 6.333333
df[ k1_mean. columns] = k1_mean
df
data1 data2 key1 key2 mean_data1 mean_data2 0 4 9 a one 3.333333 6.333333 1 5 2 a two 3.333333 6.333333 2 1 9 b one 2.000000 9.000000 3 3 9 b two 2.000000 9.000000 4 1 8 a one 3.333333 6.333333
距平化
与平均值的差异值
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) ,
columns= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index= [ 'Alice' , 'Bob' , 'Candy' , 'Dark' , 'Emily' ] )
df
a b c d e Alice 4 8 1 7 6 Bob 4 4 4 9 7 Candy 6 2 2 4 6 Dark 4 2 1 4 5 Emily 4 3 4 2 4
def demean ( s) :
return s - s. mean( )
key = [ 'one' , 'one' , 'two' , 'one' , 'two' ]
demeaned = df. groupby( key) . transform( demean)
demeaned
a b c d e Alice 0 3.333333 -1 0.333333 0 Bob 0 -0.666667 2 2.333333 1 Candy 1 -0.500000 -1 1.000000 1 Dark 0 -2.666667 -1 -2.666667 -1 Emily -1 0.500000 1 -1.000000 -1
demeaned. groupby( key) . mean( )
a b c d e one 0 -2.960595e-16 0 -2.960595e-16 0 two 0 0.000000e+00 0 0.000000e+00 0
apply 函数
我们介绍过 DataFrame 的 apply 函数是逐行或逐列来处理数据。GroupBy 的 apply 函数对每个分组进行计算。
df = pd. DataFrame( { 'key1' : [ 'a' , 'a' , 'b' , 'b' , 'a' , 'a' , 'a' , 'b' , 'b' , 'a' ] ,
'key2' : [ 'one' , 'two' , 'one' , 'two' , 'one' , 'one' , 'two' , 'one' , 'two' , 'one' ] ,
'data1' : np. random. randint( 1 , 10 , 10 ) ,
'data2' : np. random. randint( 1 , 10 , 10 ) } )
df
data1 data2 key1 key2 0 3 9 a one 1 5 9 a two 2 7 4 b one 3 7 6 b two 4 9 7 a one 5 3 7 a one 6 3 3 a two 7 4 5 b one 8 8 2 b two 9 7 4 a one
def top ( df, n= 2 , column= 'data1' ) :
return df. sort_values( by= column, ascending= False ) [ : n]
top( df, n= 5 )
data1 data2 key1 key2 4 9 7 a one 8 8 2 b two 2 7 4 b one 3 7 6 b two 9 7 4 a one
df. groupby( 'key1' ) . apply ( top)
data1 data2 key1 key2 key1 a 4 9 7 a one 9 7 4 a one b 8 8 2 b two 2 7 4 b one
df. groupby( 'key1' ) . apply ( top, n= 3 , column= 'data2' )
data1 data2 key1 key2 key1 a 0 3 9 a one 1 5 9 a two 4 9 7 a one b 3 7 6 b two 7 4 5 b one 2 7 4 b one
df. groupby( 'key1' , group_keys= False ) . apply ( top)
data1 data2 key1 key2 4 9 7 a one 9 7 4 a one 8 8 2 b two 2 7 4 b one
apply 应用示例:用不同的分组平均值填充空缺数据
states = [ 'Ohio' , 'New York' , 'Vermont' , 'Florida' ,
'Oregon' , 'Nevada' , 'California' , 'Idaho' ]
group_key = [ 'East' ] * 4 + [ 'West' ] * 4
data = pd. Series( np. random. randn( 8 ) , index= states)
data[ [ 'Vermont' , 'Nevada' , 'Idaho' ] ] = np. nan
data
Ohio 0.479291
New York -0.301839
Vermont NaN
Florida 0.476204
Oregon -0.753084
Nevada NaN
California -0.686005
Idaho NaN
dtype: float64
data. groupby( group_key) . mean( )
East 0.217885
West -0.719545
dtype: float64
fill_mean = lambda g: g. fillna( g. mean( ) )
data. groupby( group_key) . apply ( fill_mean)
Ohio 0.479291
New York -0.301839
Vermont 0.217885
Florida 0.476204
Oregon -0.753084
Nevada -0.719545
California -0.686005
Idaho -0.719545
dtype: float64
载入数据到 Pandas
索引:将一个列或多个列读取出来构成 DataFrame,其中涉及是否从文件中读取索引以及列名 类型推断和数据转换:包括用户自定义的转换以及缺失值标记 日期解析 迭代:针对大文件进行逐块迭代。这个是Pandas和Python原生的csv库的最大区别 不规整数据问题:跳过一些行,或注释等等
df = pd. read_csv( 'data/ex1.csv' )
df
a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
也可以用魔术,命令查看
df = pd. read_csv( 'data/ex1.csv' )
df
df = pd. read_csv( 'data/ex1.csv' , sep= ',' )
df
a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
分隔符可以指定为正则表达式
没有列名称的时候
pd. read_csv( 'data/ex2.csv' , header= None )
0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
pd. read_csv( 'data/ex2.csv' , header= None , names= [ 'a' , 'b' , 'c' , 'd' , 'msg' ] )
a b c d msg 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
pd. read_csv( 'data/ex2.csv' , header= None , names= [ 'a' , 'b' , 'c' , 'd' , 'msg' ] , index_col= 'msg' )
a b c d msg hello 1 2 3 4 world 5 6 7 8 foo 9 10 11 12
pd. read_csv( 'data/ex2.csv' , header= None , names= [ 'a' , 'b' , 'c' , 'd' , 'msg' ] , index_col= [ 'msg' , 'a' ] )
b c d msg a hello 1 2 3 4 world 5 6 7 8 foo 9 10 11 12
处理不规则的分隔符
pd. read_table( 'data/ex3.csv' , sep= '\s+' )
A B C aaa -0.264438 -1.026059 -0.619500 bbb 0.927272 0.302904 -0.032399 ccc -0.264273 -0.386314 -0.217601 ddd -0.871858 -0.348382 1.100491
当文件中第一行只有3列,而第2行开始就有四列,此时pandas会只能的把第一列作为行索引,第一行作为列索引。
缺失值处理
pd. read_csv( 'data/ex5.csv' )
something a b c d message 0 one 1 2 3 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11 12 foo
源文件第一行第6列源文件是NA,可以当作字符串处理,也可以当作缺失值处理。这里pandas默认以缺失值处理。
pd. read_csv( 'data/ex5.csv' , na_values= [ 'NA' , 'NULL' , 'foo' ] )
something a b c d message 0 one 1 2 3 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11 12 NaN
na_values是指定哪些值作为缺失值
pd. read_csv( 'data/ex5.csv' , na_values= { 'message' : [ 'foo' , 'NA' ] , 'something' : [ 'two' ] } )
something a b c d message 0 one 1 2 3 4 NaN 1 NaN 5 6 NaN 8 world 2 three 9 10 11 12 NaN
使用字典的形式,指定某一列的哪些值作为缺失值。
逐块读取数据
pd. read_csv( 'data/ex6.csv' , nrows= 10 )
one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R 4 0.354628 -0.133116 0.283763 -0.837063 Q 5 1.817480 0.742273 0.419395 -2.251035 Q 6 -0.776764 0.935518 -0.332872 -1.875641 U 7 -0.913135 1.530624 -0.572657 0.477252 K 8 0.358480 -0.497572 -0.367016 0.507702 S 9 -1.740877 -1.160417 -1.637830 2.172201 G
nrows表示读几行
tr = pd. read_csv( 'data/ex6.csv' , chunksize= 1000 )
key_count = pd. Series( [ ] )
for pieces in tr:
key_count = key_count. add( pieces[ 'key' ] . value_counts( ) , fill_value= 0 )
key_count = key_count. sort_values( ascending= False )
key_count[ : 10 ]
E 368
X 364
L 346
O 343
Q 340
M 338
J 337
F 335
K 334
H 330
dtype: float64
chunksize是将原文件分块,返回一个对象,对象类型是pandas.io.parsers.TextFileReader,这个对象是支持python的迭代的,可以使用for循环对没1000行进性迭代。add函数中的fill_value=0表示原Series中如果没有的话以0代替。
保存数据到磁盘
df = pd. read_csv( 'data/ex5.csv' )
df
something a b c d message 0 one 1 2 3 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11 12 foo
df. to_csv( 'data/ex5_out.csv' )
% more data/ ex5_out. csv
df = pd. read_csv( 'data/ex5_out.csv' )
df
Unnamed: 0 something a b c d message 0 0 one 1 2 3 4 NaN 1 1 two 5 6 NaN 8 world 2 2 three 9 10 11 12 foo
写的时候把索引值写进去。
df = pd. read_csv( 'data/ex5.csv' )
df. to_csv( 'data/ex5_out.csv' , index= False )
% more data/ ex5_out. csv
df = pd. read_csv( 'data/ex5.csv' )
df. to_csv( 'data/ex5_out.csv' , index= False , header= None )
% more data/ ex5_out. csv
df = pd. read_csv( 'data/ex5.csv' )
df. to_csv( 'data/ex5_out.csv' , index= False , sep= '|' )
% more data/ ex5_out. csv
df = pd. read_csv( 'data/ex5.csv' )
df. to_csv( 'data/ex5_out.csv' , index= False , columns= [ 'a' , 'b' , 'message' ] )
% more data/ ex5_out. csv
其他格式
HDFS:HDFS是C语言实现的库,可以高效地读取此磁盘上的二进制存储的科学数据 Excel文件:pd.read_excel/pd.ExcelFile/pd.ExcelWriter JSON:通过json模块转换为字典,在转换为DataFrame SQL数据库:通过pd.io.sql模块来从数据库读取数据 NoSQL(MongoDB)数据库:需要结合相应的数据库模块,如pymongo。再通过游标把数据读出来,转换为DataFrame
时间日期
时间戳 tiimestamp:固定的时刻 -> pd.Timestamp 固定时期 period:比如 2016年3月份,再如2015年销售额 -> pd.Period 时间间隔 interval:由起始时间和结束时间来表示,固定时期是时间间隔的一个特殊
时间日期在 Pandas 里的作用
Python datetime
python 标准库里提供了时间日期的处理。这个是时间日期的基础。
from datetime import datetime
from datetime import timedelta
now = datetime. now( )
now
datetime.datetime(2016, 4, 28, 14, 49, 51, 307000)
now. year, now. month, now. day
(2016, 4, 28)
date1 = datetime( 2016 , 3 , 20 )
date2 = datetime( 2016 , 3 , 16 )
delta = date1 - date2
delta
datetime.timedelta(4)
delta. days
4
delta. total_seconds( )
345600.0
date2 + delta
datetime.datetime(2016, 3, 20, 0, 0)
date2 + timedelta( 4.5 )
datetime.datetime(2016, 3, 20, 12, 0)
字符串和 datetime 转换
关于 datetime 格式定义,可以参阅 python 官方文档
date = datetime( 2016 , 3 , 20 , 8 , 30 )
date
datetime.datetime(2016, 3, 20, 8, 30)
str ( date)
'2016-03-20 08:30:00'
strftime接受的是一个格式化的样式
date. strftime( '%Y-%m-%d %H:%M:%S' )
'2016-03-20 08:30:00'
datetime. strptime( '2016-03-20 09:30' , '%Y-%m-%d %H:%M' )
datetime.datetime(2016, 3, 20, 9, 30)
Pandas 里的时间序列
Pandas 里使用 Timestamp 来表达时间
最典型的是用datetime作为序列来创建一个Series。
dates = [ datetime( 2016 , 3 , 1 ) , datetime( 2016 , 3 , 2 ) , datetime( 2016 , 3 , 3 ) , datetime( 2016 , 3 , 4 ) ]
s = pd. Series( np. random. randn( 4 ) , index= dates)
s
2016-03-01 1.650889
2016-03-02 -0.328463
2016-03-03 1.674872
2016-03-04 -0.310849
dtype: float64
type ( s. index)
pandas.tseries.index.DatetimeIndex
type ( s. index[ 0 ] )
pandas.tslib.Timestamp
pandas.tslib.Timestamp表示的是一个时间戳对象
日期范围
生成日期范围
pandas简便的生成时间序列的方法
pd. date_range( '20160320' , '20160331' )
DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
'2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
'2016-03-28', '2016-03-29', '2016-03-30', '2016-03-31'],
dtype='datetime64[ns]', freq='D')
如果不用开始结束时间,要明确使用start
pd. date_range( start= '20160320' , periods= 10 )
DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
'2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
'2016-03-28', '2016-03-29'],
dtype='datetime64[ns]', freq='D')
normalize=True是去除每一个时间的时分秒那部分
pd. date_range( start= '2016-03-20 16:23:32' , periods= 10 , normalize= True )
DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
'2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
'2016-03-28', '2016-03-29'],
dtype='datetime64[ns]', freq='D')
时间频率
pd. date_range( start= '20160320' , periods= 10 , freq= 'W' )
DatetimeIndex(['2016-03-20', '2016-03-27', '2016-04-03', '2016-04-10',
'2016-04-17', '2016-04-24', '2016-05-01', '2016-05-08',
'2016-05-15', '2016-05-22'],
dtype='datetime64[ns]', freq='W-SUN')
pd. date_range( start= '20160320' , periods= 10 , freq= 'M' )
DatetimeIndex(['2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30',
'2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31',
'2016-11-30', '2016-12-31'],
dtype='datetime64[ns]', freq='M')
pd. date_range( start= '20160320' , periods= 10 , freq= 'BM' )
DatetimeIndex(['2016-03-31', '2016-04-29', '2016-05-31', '2016-06-30',
'2016-07-29', '2016-08-31', '2016-09-30', '2016-10-31',
'2016-11-30', '2016-12-30'],
dtype='datetime64[ns]', freq='BM')
pd. date_range( start= '20160320' , periods= 10 , freq= '4H' )
DatetimeIndex(['2016-03-20 00:00:00', '2016-03-20 04:00:00',
'2016-03-20 08:00:00', '2016-03-20 12:00:00',
'2016-03-20 16:00:00', '2016-03-20 20:00:00',
'2016-03-21 00:00:00', '2016-03-21 04:00:00',
'2016-03-21 08:00:00', '2016-03-21 12:00:00'],
dtype='datetime64[ns]', freq='4H')
pd.Period 表示时期,比如几日,月或几个月等。比如用来统计每个月的销售额,就可以用时期作为单位。
p1 = pd. Period( 2010 )
p1
Period('2010', 'A-DEC')
p2 = p1 + 2
p2
Period('2012', 'A-DEC')
A-DEC中的A表示年,DEC表示12月分
p2 - p1
2L
p1 = pd. Period( 2016 , freq= 'M' )
p1
Period('2016-01', 'M')
p1 + 3
Period('2016-04', 'M')
时期序列
pd. period_range( start= '2016-01' , periods= 12 , freq= 'M' )
PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
'2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12'],
dtype='int64', freq='M')
pd. period_range( start= '2016-01' , end= '2016-10' , freq= 'M' )
PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
'2016-07', '2016-08', '2016-09', '2016-10'],
dtype='int64', freq='M')
注意:这里返回的是一个periodindex的日期,不同于DatetimeIndex
index = pd. PeriodIndex( [ '2016Q1' , '2016Q2' , '2016Q3' ] , freq= 'Q-DEC' )
index
Q表示一个季度,freq也可以使用Q
PeriodIndex(['2016Q1', '2016Q2', '2016Q3'], dtype='int64', freq='Q-DEC')
时期的频率转换
A-DEC: 以 12 月份作为结束的年时期 A-NOV: 以 11 月份作为结束的年时期 Q-DEC: 以 12 月份作为结束的季度时期
p = pd. Period( '2016' , freq= 'A-DEC' )
p
Period('2016', 'A-DEC')
p. asfreq( 'M' , how= 'start' )
Period('2016-01', 'M')
p. asfreq( 'M' , how= 'end' )
Period('2016-12', 'M')
默认的是end,转换成以月为单位的时候会自动生成16年12月
p = pd. Period( '2016-04' , freq= 'M' )
p
Period('2016-04', 'M')
p. asfreq( 'A-DEC' )
Period('2016', 'A-DEC')
p. asfreq( 'A-MAR' )
Period('2017', 'A-MAR')
Pandas 支持 12 种季度型频率,从 Q-JAN 到 Q-DEC
p = pd. Period( '2016Q4' , 'Q-JAN' )
p
Period('2016Q4', 'Q-JAN')
p. asfreq( 'D' , how= 'start' ) , p. asfreq( 'D' , how= 'end' )
(Period('2015-11-01', 'D'), Period('2016-01-31', 'D'))
p4pm = ( p. asfreq( 'B' , how= 'end' ) - 1 ) . asfreq( 'T' , 'start' ) + 16 * 60
p4pm
Period('2016-01-28 16:00', 'T')
p4pm. to_timestamp( )
Timestamp('2016-01-28 16:00:00')
Timestamp 和 Period 相互转换
ts = pd. Series( np. random. randn( 5 ) , index = pd. date_range( '2016-01-01' , periods= 5 , freq= 'M' ) )
ts
2016-01-31 -0.773323
2016-02-29 0.215953
2016-03-31 1.301631
2016-04-30 -0.066134
2016-05-31 1.651792
Freq: M, dtype: float64
ts. to_period( )
2016-01 -0.773323
2016-02 0.215953
2016-03 1.301631
2016-04 -0.066134
2016-05 1.651792
Freq: M, dtype: float64
从日为单位转换以月为单位后,会显示当前月分的最后一天,以表示当前月分有多少天
ts = pd. Series( np. random. randn( 5 ) , index = pd. date_range( '2016-12-29' , periods= 5 , freq= 'D' ) )
ts
2016-12-29 -0.110462
2016-12-30 -0.265792
2016-12-31 -0.382456
2017-01-01 -0.036111
2017-01-02 -1.029658
Freq: D, dtype: float64
pts = ts. to_period( freq= 'M' )
pts
2016-12 -0.110462
2016-12 -0.265792
2016-12 -0.382456
2017-01 -0.036111
2017-01 -1.029658
Freq: M, dtype: float64
pts. groupby( level= 0 ) . sum ( )
2016-12 -0.758711
2017-01 -1.065769
Freq: M, dtype: float64
pts. to_timestamp( how= 'end' )
2016-12-31 -0.110462
2016-12-31 -0.265792
2016-12-31 -0.382456
2017-01-31 -0.036111
2017-01-31 -1.029658
dtype: float64
这样转会会丢失数据的精度,默认是以起始时间转换成时间戳的,这里是大的往小的转换
重采样
高频率 -> 低频率 -> 降采样:5 分钟股票交易数据转换为日交易数据 低频率 -> 高频率 -> 升采样 其他重采样:每周三 (W-WED) 转换为每周五 (W-FRI)
ts = pd. Series( np. random. randint( 0 , 50 , 60 ) , index= pd. date_range( '2016-04-25 09:30' , periods= 60 , freq= 'T' ) )
ts
2016-04-25 09:30:00 18
2016-04-25 09:31:00 41
2016-04-25 09:32:00 49
2016-04-25 09:33:00 26
2016-04-25 09:34:00 5
2016-04-25 09:35:00 12
2016-04-25 09:36:00 6
2016-04-25 09:37:00 47
2016-04-25 09:38:00 16
2016-04-25 09:39:00 37
2016-04-25 09:40:00 44
2016-04-25 09:41:00 8
2016-04-25 09:42:00 22
2016-04-25 09:43:00 24
2016-04-25 09:44:00 12
2016-04-25 09:45:00 26
2016-04-25 09:46:00 30
2016-04-25 09:47:00 38
2016-04-25 09:48:00 5
2016-04-25 09:49:00 26
2016-04-25 09:50:00 39
2016-04-25 09:51:00 7
2016-04-25 09:52:00 6
2016-04-25 09:53:00 12
2016-04-25 09:54:00 24
2016-04-25 09:55:00 0
2016-04-25 09:56:00 12
2016-04-25 09:57:00 27
2016-04-25 09:58:00 10
2016-04-25 09:59:00 26
2016-04-25 10:00:00 27
2016-04-25 10:01:00 18
2016-04-25 10:02:00 27
2016-04-25 10:03:00 25
2016-04-25 10:04:00 25
2016-04-25 10:05:00 35
2016-04-25 10:06:00 28
2016-04-25 10:07:00 3
2016-04-25 10:08:00 20
2016-04-25 10:09:00 48
2016-04-25 10:10:00 5
2016-04-25 10:11:00 48
2016-04-25 10:12:00 30
2016-04-25 10:13:00 2
2016-04-25 10:14:00 11
2016-04-25 10:15:00 18
2016-04-25 10:16:00 21
2016-04-25 10:17:00 32
2016-04-25 10:18:00 43
2016-04-25 10:19:00 10
2016-04-25 10:20:00 5
2016-04-25 10:21:00 45
2016-04-25 10:22:00 3
2016-04-25 10:23:00 30
2016-04-25 10:24:00 3
2016-04-25 10:25:00 24
2016-04-25 10:26:00 46
2016-04-25 10:27:00 2
2016-04-25 10:28:00 33
2016-04-25 10:29:00 25
Freq: T, dtype: int32
T表示分钟,因为M与月份冲突
ts. resample( '5min' , how= 'sum' )
2016-04-25 09:30:00 139
2016-04-25 09:35:00 118
2016-04-25 09:40:00 110
2016-04-25 09:45:00 125
2016-04-25 09:50:00 88
2016-04-25 09:55:00 75
2016-04-25 10:00:00 122
2016-04-25 10:05:00 134
2016-04-25 10:10:00 96
2016-04-25 10:15:00 124
2016-04-25 10:20:00 86
2016-04-25 10:25:00 130
Freq: 5T, dtype: int32
数据的聚方式是加和
ts. resample( '5min' , how= 'sum' , label= 'right' )
2016-04-25 09:35:00 139
2016-04-25 09:40:00 118
2016-04-25 09:45:00 110
2016-04-25 09:50:00 125
2016-04-25 09:55:00 88
2016-04-25 10:00:00 75
2016-04-25 10:05:00 122
2016-04-25 10:10:00 134
2016-04-25 10:15:00 96
2016-04-25 10:20:00 124
2016-04-25 10:25:00 86
2016-04-25 10:30:00 130
Freq: 5T, dtype: int32
right表示以结束时间表示5分钟
OHLC 重采样
金融数据专用:Open/High/Low/Close
ts. resample( '5min' , how= 'ohlc' )
open high low close 2016-04-25 09:30:00 18 49 5 5 2016-04-25 09:35:00 12 47 6 37 2016-04-25 09:40:00 44 44 8 12 2016-04-25 09:45:00 26 38 5 26 2016-04-25 09:50:00 39 39 6 24 2016-04-25 09:55:00 0 27 0 26 2016-04-25 10:00:00 27 27 18 25 2016-04-25 10:05:00 35 48 3 48 2016-04-25 10:10:00 5 48 2 11 2016-04-25 10:15:00 18 43 10 10 2016-04-25 10:20:00 5 45 3 3 2016-04-25 10:25:00 24 46 2 25
ts = pd. Series( np. random. randint( 0 , 50 , 100 ) , index= pd. date_range( '2016-03-01' , periods= 100 , freq= 'D' ) )
ts
2016-03-01 13
2016-03-02 21
2016-03-03 26
2016-03-04 3
2016-03-05 31
2016-03-06 29
2016-03-07 42
2016-03-08 24
2016-03-09 10
2016-03-10 42
2016-03-11 42
2016-03-12 7
2016-03-13 10
2016-03-14 48
2016-03-15 12
2016-03-16 15
2016-03-17 16
2016-03-18 34
2016-03-19 45
2016-03-20 40
2016-03-21 45
2016-03-22 46
2016-03-23 21
2016-03-24 27
2016-03-25 10
2016-03-26 47
2016-03-27 8
2016-03-28 9
2016-03-29 0
2016-03-30 20
..
2016-05-10 38
2016-05-11 46
2016-05-12 8
2016-05-13 15
2016-05-14 13
2016-05-15 30
2016-05-16 25
2016-05-17 15
2016-05-18 3
2016-05-19 5
2016-05-20 21
2016-05-21 18
2016-05-22 11
2016-05-23 47
2016-05-24 14
2016-05-25 33
2016-05-26 37
2016-05-27 40
2016-05-28 5
2016-05-29 27
2016-05-30 2
2016-05-31 31
2016-06-01 31
2016-06-02 41
2016-06-03 28
2016-06-04 2
2016-06-05 21
2016-06-06 10
2016-06-07 21
2016-06-08 18
Freq: D, dtype: int32
ts. groupby( lambda x: x. month) . sum ( )
3 759
4 648
5 748
6 172
dtype: int32
ts. groupby( ts. index. to_period( 'M' ) ) . sum ( )
2016-03 759
2016-04 648
2016-05 748
2016-06 172
Freq: M, dtype: int32
升采样和插值
df = pd. DataFrame( np. random. randint( 1 , 50 , 2 ) , index= pd. date_range( '2016-04-22' , periods= 2 , freq= 'W-FRI' ) )
df
df. resample( 'D' )
0 2016-04-22 10 2016-04-23 NaN 2016-04-24 NaN 2016-04-25 NaN 2016-04-26 NaN 2016-04-27 NaN 2016-04-28 NaN 2016-04-29 6
df. resample( 'D' , fill_method= 'ffill' , limit= 3 )
0 2016-04-22 10 2016-04-23 10 2016-04-24 10 2016-04-25 10 2016-04-26 NaN 2016-04-27 NaN 2016-04-28 NaN 2016-04-29 6
df. resample( 'W-MON' , fill_method= 'ffill' )
因为22是周五,转换为周一的话,不填充就是nan
时期重采样
df = pd. DataFrame( np. random. randint( 2 , 30 , ( 24 , 4 ) ) ,
index= pd. period_range( '2015-01' , '2016-12' , freq= 'M' ) ,
columns= list ( 'ABCD' ) )
df
A B C D 2015-01 20 7 22 18 2015-02 2 28 21 19 2015-03 13 17 12 7 2015-04 24 17 20 14 2015-05 15 13 15 20 2015-06 19 28 2 22 2015-07 20 7 2 27 2015-08 10 18 2 16 2015-09 17 24 11 9 2015-10 23 2 21 25 2015-11 24 3 19 8 2015-12 7 16 6 12 2016-01 18 13 8 15 2016-02 17 14 2 21 2016-03 17 6 5 24 2016-04 24 14 22 14 2016-05 16 14 20 14 2016-06 26 29 14 15 2016-07 2 11 11 2 2016-08 12 11 17 18 2016-09 19 21 4 16 2016-10 21 16 11 7 2016-11 16 23 2 22 2016-12 21 9 27 11
adf = df. resample( 'A-DEC' , how= 'mean' )
adf
A B C D 2015 16.166667 15.000000 12.750000 16.416667 2016 17.416667 15.083333 11.916667 14.916667
df. resample( 'A-MAY' , how= 'mean' )
A B C D 2015 14.800000 16.400000 18.000000 15.60 2016 17.666667 13.250000 10.000000 17.25 2017 16.714286 17.142857 12.285714 13.00
adf. resample( 'Q-DEC' )
A B C D 2015Q1 16.166667 15.000000 12.750000 16.416667 2015Q2 NaN NaN NaN NaN 2015Q3 NaN NaN NaN NaN 2015Q4 NaN NaN NaN NaN 2016Q1 17.416667 15.083333 11.916667 14.916667 2016Q2 NaN NaN NaN NaN 2016Q3 NaN NaN NaN NaN 2016Q4 NaN NaN NaN NaN
adf. resample( 'Q-DEC' , fill_method= 'ffill' )
A B C D 2015Q1 16.166667 15.000000 12.750000 16.416667 2015Q2 16.166667 15.000000 12.750000 16.416667 2015Q3 16.166667 15.000000 12.750000 16.416667 2015Q4 16.166667 15.000000 12.750000 16.416667 2016Q1 17.416667 15.083333 11.916667 14.916667 2016Q2 17.416667 15.083333 11.916667 14.916667 2016Q3 17.416667 15.083333 11.916667 14.916667 2016Q4 17.416667 15.083333 11.916667 14.916667
从文件中读取日期序列
df = pd. read_csv( 'data/002001.csv' , index_col= 'Date' )
df
Open High Low Close Volume Adj Close Date 2015-12-22 16.86 17.13 16.48 16.95 13519900 16.95 2015-12-21 16.31 17.00 16.20 16.85 14132200 16.85 2015-12-18 16.59 16.70 16.21 16.31 10524300 16.31 2015-12-17 16.28 16.75 16.16 16.60 12326500 16.60 2015-12-16 16.23 16.42 16.05 16.28 8026000 16.28 2015-12-15 16.06 16.31 15.95 16.18 6647500 16.18 2015-12-14 15.60 16.06 15.45 16.06 8355200 16.06 2015-12-11 15.50 15.80 15.41 15.62 7243500 15.62 2015-12-10 15.99 16.05 15.51 15.56 7654900 15.56 2015-12-09 16.00 16.19 15.80 15.83 7926900 15.83 2015-12-08 16.54 16.55 16.01 16.05 7640100 16.05 2015-12-07 16.50 17.04 16.48 16.63 11917200 16.63 2015-12-04 16.13 16.85 16.01 16.62 14011100 16.62 2015-12-03 15.97 16.34 15.88 16.21 9504000 16.21 2015-12-02 15.89 16.04 15.50 15.88 11229600 15.88 2015-12-01 15.67 15.96 15.50 15.85 7192200 15.85 2015-11-30 15.54 15.90 15.05 15.70 11615200 15.70 2015-11-27 16.61 16.99 15.10 15.54 15177000 15.54 2015-11-26 16.98 17.22 16.62 16.78 13196300 16.78 2015-11-25 16.15 17.04 16.03 16.94 18600100 16.94 2015-11-24 15.90 16.20 15.70 16.15 8561200 16.15 2015-11-23 16.09 16.32 16.00 16.05 9441700 16.05 2015-11-20 15.96 16.17 15.81 16.08 8022200 16.08 2015-11-19 15.75 16.05 15.71 16.02 5193300 16.02 2015-11-18 16.26 16.30 15.72 15.75 7318500 15.75 2015-11-17 16.41 16.47 16.11 16.22 11479800 16.22 2015-11-16 15.70 16.22 15.61 16.21 9083200 16.21 2015-11-13 16.36 16.47 15.90 15.95 12924400 15.95 2015-11-12 16.23 16.92 16.00 16.59 16492800 16.59 2015-11-11 16.16 16.28 15.81 16.22 15661900 16.22 2015-11-10 16.29 16.69 16.04 16.15 21457600 16.15 2015-11-09 15.70 16.29 15.56 16.02 20842600 16.02 2015-11-06 15.53 16.01 15.41 15.86 17735800 15.86 2015-11-05 15.33 15.79 15.21 15.52 19051400 15.52 2015-11-04 14.65 15.35 14.65 15.33 14578200 15.33 2015-11-03 14.84 14.96 14.44 14.62 6576300 14.62 2015-11-02 14.91 15.18 14.74 14.74 9487800 14.74 2015-10-30 15.25 15.52 14.81 15.22 12908500 15.22 2015-10-29 15.01 15.36 14.96 15.30 11177100 15.30 2015-10-28 15.14 15.50 14.96 15.02 11373200 15.02 2015-10-27 15.10 15.17 14.51 15.15 12950400 15.15 2015-10-26 15.41 15.55 14.87 15.18 15844500 15.18 2015-10-23 14.80 15.23 14.75 15.20 14769000 15.20 2015-10-22 14.28 14.82 14.25 14.73 10428900 14.73 2015-10-21 15.24 15.70 14.08 14.26 21113500 14.26 2015-10-20 14.99 15.24 14.89 15.22 11935800 15.22 2015-10-19 15.27 15.35 14.85 15.03 11601300 15.03 2015-10-16 15.23 15.35 14.82 15.25 14168700 15.25 2015-10-15 14.73 15.15 14.60 15.12 11177700 15.12 2015-10-14 14.99 15.12 14.72 14.73 10368900 14.73 2015-10-13 15.02 15.19 14.85 15.07 13408200 15.07 2015-10-12 14.63 15.43 14.41 15.30 24110800 15.30 2015-10-09 14.50 14.79 14.11 14.62 23818500 14.62 2015-10-08 14.75 14.75 14.65 14.75 18317200 14.75 2015-10-07 13.41 13.41 13.41 13.41 0 13.41 2015-10-06 13.41 13.41 13.41 13.41 0 13.41 2015-10-05 13.41 13.41 13.41 13.41 0 13.41 2015-10-02 13.41 13.41 13.41 13.41 0 13.41 2015-10-01 13.41 13.41 13.41 13.41 0 13.41
df. index
Index([u'2015-12-22', u'2015-12-21', u'2015-12-18', u'2015-12-17',
u'2015-12-16', u'2015-12-15', u'2015-12-14', u'2015-12-11',
u'2015-12-10', u'2015-12-09', u'2015-12-08', u'2015-12-07',
u'2015-12-04', u'2015-12-03', u'2015-12-02', u'2015-12-01',
u'2015-11-30', u'2015-11-27', u'2015-11-26', u'2015-11-25',
u'2015-11-24', u'2015-11-23', u'2015-11-20', u'2015-11-19',
u'2015-11-18', u'2015-11-17', u'2015-11-16', u'2015-11-13',
u'2015-11-12', u'2015-11-11', u'2015-11-10', u'2015-11-09',
u'2015-11-06', u'2015-11-05', u'2015-11-04', u'2015-11-03',
u'2015-11-02', u'2015-10-30', u'2015-10-29', u'2015-10-28',
u'2015-10-27', u'2015-10-26', u'2015-10-23', u'2015-10-22',
u'2015-10-21', u'2015-10-20', u'2015-10-19', u'2015-10-16',
u'2015-10-15', u'2015-10-14', u'2015-10-13', u'2015-10-12',
u'2015-10-09', u'2015-10-08', u'2015-10-07', u'2015-10-06',
u'2015-10-05', u'2015-10-02', u'2015-10-01'],
dtype='object', name=u'Date')
df = pd. read_csv( 'data/002001.csv' , index_col= 'Date' , parse_dates= True )
df. index
DatetimeIndex(['2015-12-22', '2015-12-21', '2015-12-18', '2015-12-17',
'2015-12-16', '2015-12-15', '2015-12-14', '2015-12-11',
'2015-12-10', '2015-12-09', '2015-12-08', '2015-12-07',
'2015-12-04', '2015-12-03', '2015-12-02', '2015-12-01',
'2015-11-30', '2015-11-27', '2015-11-26', '2015-11-25',
'2015-11-24', '2015-11-23', '2015-11-20', '2015-11-19',
'2015-11-18', '2015-11-17', '2015-11-16', '2015-11-13',
'2015-11-12', '2015-11-11', '2015-11-10', '2015-11-09',
'2015-11-06', '2015-11-05', '2015-11-04', '2015-11-03',
'2015-11-02', '2015-10-30', '2015-10-29', '2015-10-28',
'2015-10-27', '2015-10-26', '2015-10-23', '2015-10-22',
'2015-10-21', '2015-10-20', '2015-10-19', '2015-10-16',
'2015-10-15', '2015-10-14', '2015-10-13', '2015-10-12',
'2015-10-09', '2015-10-08', '2015-10-07', '2015-10-06',
'2015-10-05', '2015-10-02', '2015-10-01'],
dtype='datetime64[ns]', name=u'Date', freq=None)
默认他不会去解析这个时间,生成的是一个object这个对象,设置parse_dates这个参数,可以自动解析这个时间对象。 解析这个时间好处就是,可以根据时间,进行升序==重采样
wdf = df[ 'Adj Close' ] . resample( 'W-FRI' , how= 'ohlc' )
wdf
open high low close Date 2015-10-02 13.41 13.41 13.41 13.41 2015-10-09 13.41 14.75 13.41 14.62 2015-10-16 15.30 15.30 14.73 15.25 2015-10-23 15.03 15.22 14.26 15.20 2015-10-30 15.18 15.30 15.02 15.22 2015-11-06 14.74 15.86 14.62 15.86 2015-11-13 16.02 16.59 15.95 15.95 2015-11-20 16.21 16.22 15.75 16.08 2015-11-27 16.05 16.94 15.54 15.54 2015-12-04 15.70 16.62 15.70 16.62 2015-12-11 16.63 16.63 15.56 15.62 2015-12-18 16.06 16.60 16.06 16.31 2015-12-25 16.85 16.95 16.85 16.95
wdf[ 'Volume' ] = df[ 'Volume' ] . resample( 'W-FRI' , how= 'sum' )
wdf
open high low close Volume Date 2015-10-02 13.41 13.41 13.41 13.41 0 2015-10-09 13.41 14.75 13.41 14.62 42135700 2015-10-16 15.30 15.30 14.73 15.25 73234300 2015-10-23 15.03 15.22 14.26 15.20 69848500 2015-10-30 15.18 15.30 15.02 15.22 64253700 2015-11-06 14.74 15.86 14.62 15.86 67429500 2015-11-13 16.02 16.59 15.95 15.95 87379300 2015-11-20 16.21 16.22 15.75 16.08 41097000 2015-11-27 16.05 16.94 15.54 15.54 64976300 2015-12-04 15.70 16.62 15.70 16.62 53552100 2015-12-11 16.63 16.63 15.56 15.62 42382600 2015-12-18 16.06 16.60 16.06 16.31 45879500 2015-12-25 16.85 16.95 16.85 16.95 27652100
数据可视化
Pandas 的数据可视化使用 matplotlib 为基础组件。更基础的信息可参阅 matplotlib 相关内容。本节主要介绍 Pandas 里提供的比 matplotlib 更便捷的数据可视化操作。 如果想要把数据画在jupyter notebook里面,还需要启用%matplotlib inline
线型图
Series 和 DataFrame 都提供了一个 plot
的函数。可以直接画出线形图。
ts = pd. Series( np. random. randn( 1000 ) , index= pd. date_range( '1/1/2000' , periods= 1000 ) )
ts = ts. cumsum( )
ts. describe( )
count 1000.000000
mean 3.526470
std 16.243923
min -20.683881
25% -9.300320
50% -1.758149
75% 13.224696
max 42.878495
dtype: float64
加一个;就可以看到图片了
ts. plot( ) ;
ts. plot( title= 'cumsum' , style= 'r-' , ylim= [ - 30 , 30 ] , figsize= ( 4 , 3 ) ) ;
title是标题,style是图像的形状,figsize是图像的大小,ylim是纵轴
df = pd. DataFrame( np. random. randn( 1000 , 4 ) , index= ts. index, columns= list ( 'ABCD' ) )
df = df. cumsum( )
df. describe( )
A B C D count 1000.000000 1000.000000 1000.000000 1000.000000 mean -9.152453 28.751105 6.435306 4.764281 std 19.177143 11.089815 7.502915 5.644176 min -52.401351 -2.245964 -9.804085 -7.966805 25% -22.831177 21.594373 -0.634420 0.299640 50% -3.333172 28.531865 8.477188 4.178048 75% 5.859582 37.280177 12.433584 9.672910 max 22.480837 50.740147 21.215485 19.161365
df. plot( title= 'DataFrame cumsum' , figsize= ( 4 , 12 ) , subplots= True , sharex= True , sharey= True ) ;
subplots是单独把每一列画在一个图里,sharex是公用X轴,sharey是公用y轴
df[ 'I' ] = np. arange( len ( df) )
df. plot( x= 'I' , y= [ 'A' , 'C' ] )
可以指定X轴和Y轴
柱状图
df = pd. DataFrame( np. random. rand( 10 , 4 ) , columns= [ 'A' , 'B' , 'C' , 'D' ] )
df
A B C D 0 0.251322 0.710449 0.677939 0.187400 1 0.082089 0.793091 0.021801 0.287124 2 0.911240 0.802222 0.429064 0.956067 3 0.477370 0.234396 0.303595 0.945137 4 0.471124 0.205507 0.499160 0.887754 5 0.389382 0.062298 0.110157 0.705081 6 0.068719 0.325747 0.951594 0.501131 7 0.365177 0.950995 0.898784 0.822838 8 0.022807 0.721682 0.651013 0.212020 9 0.497862 0.596736 0.486710 0.672207
df. ix[ 1 ] . plot( kind= 'bar' )
<matplotlib.axes._subplots.AxesSubplot at 0x8cb5030>
选择第二行画柱状图
df. ix[ 1 ] . plot. bar( )
<matplotlib.axes._subplots.AxesSubplot at 0x8cec490>
df. plot. bar( stacked= True )
<matplotlib.axes._subplots.AxesSubplot at 0x8f2bf70>
df. plot. barh( stacked= True )
<matplotlib.axes._subplots.AxesSubplot at 0x9676070>
直方图
直方图是一种对值频率进行离散化的柱状图。数据点被分到离散的,间隔均匀的区间中,绘制各个区间中数据点的数据。
df = pd. DataFrame( { 'a' : np. random. randn( 1000 ) + 1 , 'b' : np. random. randn( 1000 ) ,
'c' : np. random. randn( 1000 ) - 1 } , columns= [ 'a' , 'b' , 'c' ] )
df[ 'a' ] . plot. hist( bins= 10 )
直方图是根据最大值和最小值生成10等份,统计了每个区间里面点的个数
df. plot. hist( subplots= True , sharex= True , sharey= True )
array([<matplotlib.axes._subplots.AxesSubplot object at 0x087FF030>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0878F5B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x086AFFB0>], dtype=object)
alpha是透明值,否则这个就重叠了
df. plot. hist( alpha= 0.5 )
<matplotlib.axes._subplots.AxesSubplot at 0x85ed350>
stacked是叠加,grid图片加入网格
df. plot. hist( stacked= True , bins= 20 , grid= True )
<matplotlib.axes._subplots.AxesSubplot at 0x9712bb0>
密度图
正态分布(高斯分布)就是一种自然界中广泛存在密度图。比如我们的身高,我们的财富,我们的智商都符合高斯分布。
绘制概率密度
df[ 'a' ] . plot. kde( )
df. plot. kde( )
df. mean( )
a 0.947151
b -0.026500
c -1.039657
dtype: float64
df. std( )
a 1.014101
b 1.028685
c 1.030285
dtype: float64
散布图
散布图是把所有的点画在同一个坐标轴上的图像。是观察两个一维数据之间关系的有效的手段。
df = pd. DataFrame( np. random. rand( 10 , 4 ) , columns= [ 'a' , 'b' , 'c' , 'd' ] )
df. plot. scatter( x= 'a' , y= 'b' ) ;
![ 在这里插入图片描述] ( https: // img- blog. csdnimg. cn/ 20200416203953372. png?x- oss- process= image/ watermark, type_ZmFuZ3poZW5naGVpdGk, shadow_10, text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5Y2tpZA== , size_16, color_FFFFFF, t_70)
```python
df = pd. DataFrame( { 'a' : np. concatenate( [ np. random. normal( 0 , 1 , 200 ) , np. random. normal( 6 , 1 , 200 ) ] ) ,
'b' : np. concatenate( [ np. random. normal( 10 , 2 , 200 ) , np. random. normal( 0 , 2 , 200 ) ] ) ,
'c' : np. concatenate( [ np. random. normal( 10 , 4 , 200 ) , np. random. normal( 0 , 4 , 200 ) ] ) } )
df. describe( )
a b c count 400.000000 400.000000 400.000000 mean 2.974366 5.035267 5.011259 std 3.158428 5.492017 6.572784 min -3.149143 -5.022496 -13.671181 25% 0.079920 -0.130532 -0.128582 50% 2.800326 5.587717 4.901682 75% 5.923731 10.139781 10.002299 max 8.688234 16.233516 19.855104
df. plot. scatter( x= 'a' , y= 'b' )
饼图
s = pd. Series( 3 * np. random. rand( 4 ) , index= [ 'a' , 'b' , 'c' , 'd' ] , name= 'series' )
s
a 0.505096
b 0.295620
c 2.855217
d 2.716670
Name: series, dtype: float64
s. plot. pie( figsize= ( 6 , 6 ) )
autopct 显示百分比的保留位数,fontsize是修改字体大小
s. plot. pie( labels= [ 'AA' , 'BB' , 'CC' , 'DD' ] , colors= [ 'r' , 'g' , 'b' , 'c' ] ,
autopct= '%.2f' , fontsize= 20 , figsize= ( 6 , 6 ) )
高级绘图
高级绘图函数在 pandas.tools.plotting 包里。