这里是用jupyter notebook写的pandas的基本用法,使用的环境是python3+windows,代码上传到csdn资源啦:ABC of Pandas
关于pandas学习还是强烈建议常去官方http://pandas.pydata.org/pandas-docs/stable/里查一查各种用法和toturial等。
下面是jupyter notebook代码导出的md文件。
数据结构 Series
from pandas import Series
print('用数组生成Series')
obj = Series([4, 7, -5, 3])
print(obj)
print(obj.values)
print(obj.index)
print()
print('指定Series的index')
obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index)
print(obj2['a'])
obj2['d'] = 6
print(obj2[['c', 'a', 'd']])
print(obj2[obj2 > 0]) # 找出大于0的元素
print('b' in obj2) # 判断索引是否存在
print('e' in obj2)
print()
print('使用字典生成Series')
sdata = {'Ohio':45000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = Series(sdata)
print(obj3)
print()
print('使用字典生成Series,并额外指定index,不匹配部分为NaN。')
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index = states)
print(obj4)
print()
print('Series相加,相同索引部分相加。')
print(obj3 + obj4)
print()
print('指定Series及其索引的名字')
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)
print()
print('替换index')
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)
用数组生成Series
0 4
1 7
2 -5
3 3
dtype: int64
[ 4 7 -5 3]
RangeIndex(start=0, stop=4, step=1)
指定Series的index
d 4
b 7
a -5
c 3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
-5
c 3
a -5
d 6
dtype: int64
d 6
b 7
c 3
dtype: int64
True
False
使用字典生成Series
Ohio 45000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
使用字典生成Series,并额外指定index,不匹配部分为NaN。
California NaN
Ohio 45000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Series相加,相同索引部分相加。
California NaN
Ohio 90000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
指定Series及其索引的名字
state
California NaN
Ohio 45000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
替换index
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
数据结构 DataFrame
import numpy as np
from pandas import Series, DataFrame
print('用字典生成DataFrame,key为列的名字。')
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year':[2000, 2001, 2002, 2001, 2002],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
print(DataFrame(data))
print(DataFrame(data, columns = ['year', 'state', 'pop'])) # 指定列顺序
print()
print('指定索引,在列中指定不存在的列,默认数据用NaN。')
frame2 = DataFrame(data,
columns = ['year', 'state', 'pop', 'debt'],
index = ['one', 'two', 'three', 'four', 'five'])
print(frame2)
print(frame2['state'])
print(frame2.year)
print(frame2.ix['three'])
frame2['debt'] = 16.5 # 修改一整列
print(frame2)
frame2.debt = np.arange(5) # 用numpy数组修改元素
print(frame2)
print()
print('用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。')
val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)
print()
print('赋值给新列')
frame2['eastern'] = (frame2.state == 'Ohio') # 如果state等于Ohio为True
print(frame2)
print(frame2.columns)
print()
print('DataFrame转置')
pop = {'Nevada':{2001:2.4, 2002:2.9},
'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = DataFrame(pop)
print(frame3)
print(frame3.T)
print()
print('指定索引顺序,以及使用切片初始化数据。')
print(DataFrame(pop, index = [2001, 2002, 2003]))
pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]}
print(DataFrame(pdata))
print()
print('指定索引和列的名称')
frame3.index.name = 'year'
frame3.columns.name = 'state'
print(frame3)
print(frame3.values)
print(frame2.values)
用字典生成DataFrame,key为列的名字。
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
指定索引,在列中指定不存在的列,默认数据用NaN。
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
赋值给新列
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')
DataFrame转置
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
指定索引顺序,以及使用切片初始化数据。
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
指定索引和列的名称
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
[[ nan 1.5]
[ 2.4 1.7]
[ 2.9 3.6]]
[[2000 'Ohio' 1.5 nan True]
[2001 'Ohio' 1.7 -1.2 True]
[2002 'Ohio' 3.6 nan True]
[2001 'Nevada' 2.4 -1.5 False]
[2002 'Nevada' 2.9 -1.7 False]]
数据结构 索引对象
import numpy as np
import pandas as pd
import sys
from pandas import Series, DataFrame, Index
print('获取index')
obj = Series(range(3), index = ['a', 'b', 'c'])
index = obj.index
print(index[1:])
try:
index[1] = 'd' # index对象read only
except:
print(sys.exc_info()[0])
print()
print('使用Index对象')
index = Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
print(obj2)
print(obj2.index is index)
print()
print('判断列和索引是否存在')
pop = {'Nevada':{20001:2.4, 2002:2.9},
'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = DataFrame(pop)
print('Ohio' in frame3.columns)
print('2003' in frame3.index)
获取index
Index(['b', 'c'], dtype='object')
<class 'TypeError'>
使用Index对象
0 1.5
1 -2.5
2 0.0
dtype: float64
True
判断列和索引是否存在
True
False
基本功能 重新索引
import numpy as np
from pandas import DataFrame, Series
print('重新指定索引及顺序')
obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
print(obj)
obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])
print(obj2)
print(obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0)) # 指定不存在元素的默认值
print()
print('重新指定索引并指定填元素充方法')
obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])
print(obj3)
print(obj3.reindex(range(6), method = 'ffill'))
print()
print('对DataFrame重新指定索引')
frame = DataFrame(np.arange(9).reshape(3, 3),
index = ['a', 'c', 'd'],
columns = ['Ohio', 'Texas', 'California'])
print(frame)
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
print()
print('重新指定column')
states = ['Texas', 'Utah', 'California']
print(frame.reindex(columns = states))
print()
print('对DataFrame重新指定索引并指定填元素充方法')
print(frame.reindex(index = ['a', 'b', 'c', 'd'],
method = 'ffill',
columns = states))
print(frame.ix[['a', 'b', 'd', 'c'], states])
重新指定索引及顺序
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
a -5.3
b 7.2
d 4.5
c 3.6
e NaN
dtype: float64
a -5.3
b 7.2
d 4.5
c 3.6
e 0.0
dtype: float64
重新指定索引并指定填元素充方法
0 blue
2 purple
4 yellow
dtype: object
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对DataFrame重新指定索引
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
重新指定column
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
对DataFrame重新指定索引并指定填元素充方法
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
d 7.0 NaN 8.0
c 4.0 NaN 5.0
基本功能 丢弃指定轴上的项
import numpy as np
from pandas import Series, DataFrame
print('Series根据索引删除元素')
obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
print(new_obj)
print(obj.drop(['d', 'c']))
print()
print('DataFrame删除元素,可指定索引或列。')
data = DataFrame(np.arange(16).reshape((4, 4)),
index = ['Ohio', 'Colorado', 'Utah', 'New York'],
columns = ['one', 'two', 'three', 'four'])
print(data)
print(data.drop(['Colorado', 'Ohio']))
print(data.drop('two', axis = 1))
print(data.drop(['two', 'four'], axis = 1))
Series根据索引删除元素
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
a 0.0
b 1.0
e 4.0
dtype: float64
DataFrame删除元素,可指定索引或列。
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
one two three four
Utah 8 9 10 11
New York 12 13 14 15
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
基本功能 索引、选取和过滤
import numpy as np
from pandas import Series, DataFrame
print('Series的索引,默认数字索引可以工作。')
obj = Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])
print(obj['b'])
print(obj[3])
print(obj[[1, 3]])
print(obj[obj < 2])
print()
print('Series的数组切片')
print(obj['b':'c']) # 闭区间
obj['b':'c'] = 5
print(obj)
print()
print('DataFrame的索引')
data = DataFrame(np.arange(16).reshape((4, 4)),
index = ['Ohio', 'Colorado', 'Utah', 'New York'],
columns = ['one', 'two', 'three', 'four'])
print(data)
print(data['two']) # 打印列
print(data[['three', 'one']])
print(data[:2])
print(data.ix['Colorado', ['two', 'three']]) # 指定索引和列
print(data.ix[['Colorado', 'Utah'], [3, 0, 1]])
print(data.ix[2]) # 打印第2行(从0开始)
print(data.ix[:'Utah', 'two']) # 从开始到Utah,第2列。
print()
print('根据条件选择')
print(data[data.three > 5])
print(data < 5) # 打印True或者False
data[data < 5] = 0
print(data)
Series的索引,默认数字索引可以工作。
1.0
3.0
b 1.0
d 3.0
dtype: float64
a 0.0
b 1.0
dtype: float64
Series的数组切片
b 1.0
c 2.0
dtype: float64
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
DataFrame的索引
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
two 5
three 6
Name: Colorado, dtype: int32
four one two
Colorado 7 4 5
Utah 11 8 9
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int32
根据条件选择
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
基本功能 算数运算和数据对齐
import numpy as np
from pandas import Series, DataFrame
print('加法')
s1 = Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g'])
print(s1)
print(s2)
print(s1 + s2)
print()
print('DataFrame加法,索引和列都必须匹配。')
df1 = DataFrame(np.arange(9.).reshape((3, 3)),
columns = list('bcd'),
index = ['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12).reshape((4, 3)),
columns = list('bde'),
index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print(df2)
print(df1 + df2)
print()
print('数据填充')
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns = list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns = list('abcde'))
print(df1)
print(df2)
print(df1.add(df2, fill_value = 0))
print(df1.reindex(columns = df2.columns, fill_value = 0))
print()
print('DataFrame与Series之间的操作')
arr = np.arange(12.).reshape((3, 4))
print(arr)
print(arr[0])
print(arr - arr[0])
frame = DataFrame(np.arange(12).reshape((4, 3)),
columns = list('bde'),
index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
print(frame)
print(series)
print(frame - series)
series2 = Series(range(3), index = list('bef'))
print(frame + series2)
series3 = frame['d']
print(frame.sub(series3, axis = 0)) # 按列减
加法
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
DataFrame加法,索引和列都必须匹配。
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
数据填充
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
DataFrame与Series之间的操作
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
[ 0. 1. 2. 3.]
[[ 0. 0. 0. 0.]
[ 4. 4. 4. 4.]
[ 8. 8. 8. 8.]]
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b 0
d 1
e 2
Name: Utah, dtype: int32
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
基本功能 函数应用和映射
应用示例中有“取出最大值和最小值”和“所有数据保留为两位数”,写的挺好
import numpy as np
from pandas import Series, DataFrame
print('函数')
frame = DataFrame(np.random.randn(4, 3),
columns = list('bde'),
index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
print(np.abs(frame))
print()
print('lambda以及应用')
f = lambda x: x.max() - x.min()
print(frame.apply(f))
print(frame.apply(f, axis = 1))
def f(x):
return Series([x.min(), x.max()], index = ['min', 'max'])
print(frame.apply(f))
print()
print('applymap和map')
_format = lambda x: '%.2f' % x
print(frame.applymap(_format))
print(frame['e'].map(_format))
函数
b d e
Utah -1.902941 2.046163 0.685950
Ohio -0.377962 1.771554 -0.885729
Texas -0.247769 -0.918187 -3.228982
Oregon -0.039260 2.129338 -0.077359
b d e
Utah 1.902941 2.046163 0.685950
Ohio 0.377962 1.771554 0.885729
Texas 0.247769 0.918187 3.228982
Oregon 0.039260 2.129338 0.077359
lambda以及应用
b 1.863681
d 3.047524
e 3.914932
dtype: float64
Utah 3.949104
Ohio 2.657283
Texas 2.981213
Oregon 2.206697
dtype: float64
b d e
min -1.902941 -0.918187 -3.228982
max -0.039260 2.129338 0.685950
applymap和map
b d e
Utah -1.90 2.05 0.69
Ohio -0.38 1.77 -0.89
Texas -0.25 -0.92 -3.23
Oregon -0.04 2.13 -0.08
Utah 0.69
Ohio -0.89
Texas -3.23
Oregon -0.08
Name: e, dtype: object
基本功能 排序和排名
import numpy as np
from pandas import Series, DataFrame
print('根据索引排序,对于DataFrame可以指定轴。')
obj = Series(range(4), index = ['d', 'a', 'b', 'c'])
print(obj.sort_index())
frame = DataFrame(np.arange(8).reshape((2, 4)),
index = ['three', 'one'],
columns = list('dabc'))
print(frame.sort_index())
print(frame.sort_index(axis = 1))
print(frame.sort_index(axis = 1, ascending = False)) # 降序
print()
print('根据值排序')
obj = Series([4, 7, -3, 2])
print(obj.sort_values()) # order已淘汰
print()
print('DataFrame指定列排序')
frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
print(frame)
print(frame.sort_values(by = 'b')) # sort_index(by = ...)已淘汰
print(frame.sort_values(by = ['a', 'b']))
print()
print('rank,求排名的平均位置(从1开始)')
obj = Series([7, -5, 7, 4, 2, 0, 4])
# 对应排名:-5(1), 0(2), 2(3), 4(4), 4(5), 7(6), 7(7)
print(obj.rank())
print(obj.rank(method = 'first')) # 取第一次出现,不求平均值。
print(obj.rank(ascending = False, method = 'max')) # 逆序,并取最大值。所以-5的rank是7.
frame = DataFrame({'b':[4.3, 7, -3, 2],
'a':[0, 1, 0, 1],
'c':[-2, 5, 8, -2.5]})
print(frame)
print(frame.rank(axis = 1)) #按行排序
根据索引排序,对于DataFrame可以指定轴。
a 1
b 2
c 3
d 0
dtype: int32
d a b c
one 4 5 6 7
three 0 1 2 3
a b c d
three 1 2 3 0
one 5 6 7 4
d c b a
three 0 3 2 1
one 4 7 6 5
根据值排序
2 -3
3 2
0 4
1 7
dtype: int64
DataFrame指定列排序
a b
0 0 4
1 1 7
2 0 -3
3 1 2
a b
2 0 -3
3 1 2
0 0 4
1 1 7
a b
2 0 -3
0 0 4
3 1 2
1 1 7
rank,求排名的平均位置(从1开始)
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0
基本功能 带有重复值的索引
import numpy as np
from pandas import Series, DataFrame
print('重复的索引')
obj = Series(range(5), index = ['a', 'a', 'b', 'b', 'c'])
print(obj.index.is_unique) # 判断是非有重复索引
print(obj)
print(obj.a)
df = DataFrame(np.random.randn(4, 3), index = ['a', 'a', 'b', 'b'])
print(df)
print(df.ix['b'].ix[0])
print(df.ix['b'].ix[1])
重复的索引
False
a 0
a 1
b 2
b 3
c 4
dtype: int32
a 0
a 1
dtype: int32
0 1 2
a 0.986532 0.665410 -1.075305
a -0.060391 1.298336 0.013000
b -1.783654 -0.596269 0.632322
b 0.911897 -0.865876 -0.951782
0 -1.783654
1 -0.596269
2 0.632322
Name: b, dtype: float64
0 0.911897
1 -0.865876
2 -0.951782
Name: b, dtype: float64
汇总和计算描述统计
import numpy as np
from pandas import Series, DataFrame
print('求和')
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
index = ['a', 'b', 'c', 'd'],
columns = ['one', 'two'])
print(df)
print(df.sum()) # 按列求和
print(df.sum(axis = 1)) # 按行求和
print()
print('平均数')
print(df.mean(axis = 1, skipna = False))
print(df.mean(axis = 1))
print()
print('其它')
print(df.idxmax())
print(df.cumsum())
print(df.describe())
obj = Series(['a', 'a', 'b', 'c'] * 4)
print(obj)
print(obj.describe())
求和
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
one 9.25
two -5.80
dtype: float64
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
平均数
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
其它
one b
two d
dtype: object
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% NaN NaN
50% NaN NaN
75% NaN NaN
max 7.100000 -1.300000
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
count 16
unique 3
top a
freq 8
dtype: object
C:\Program Files\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
RuntimeWarning)
汇总和计算描述统计 相关系数与协方差
链接失效时改一下日期即可
import numpy as np
import pandas_datareader.data as web
# import pandas.io.data as web
from pandas import DataFrame
print('相关性与协方差') # 协方差:https://zh.wikipedia.org/wiki/%E5%8D%8F%E6%96%B9%E5%B7%AE
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker, '4/1/2016', '7/15/2016')
price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume'] for tic, data in all_data.items()})
returns = price.pct_change()
print(returns.tail())
print(returns.MSFT.corr(returns.IBM))
print(returns.corr()) # 相关性,自己和自己的相关性总是1
print(returns.cov()) # 协方差
print(returns.corrwith(returns.IBM))
相关性与协方差
AAPL GOOG IBM MSFT
Date
2016-07-11 0.003103 0.013406 0.005632 0.005545
2016-07-12 0.004537 0.007761 0.011009 0.011789
2016-07-13 -0.005646 -0.005079 0.006241 0.005638
2016-07-14 0.019820 0.005537 0.014302 0.004298
2016-07-15 -0.000101 -0.001526 -0.003120 -0.000744
0.546959627072
AAPL GOOG IBM MSFT
AAPL 1.000000 0.389326 0.368959 0.389082
GOOG 0.389326 1.000000 0.624841 0.801473
IBM 0.368959 0.624841 1.000000 0.546960
MSFT 0.389082 0.801473 0.546960 1.000000
AAPL GOOG IBM MSFT
AAPL 0.000198 0.000068 0.000071 0.000083
GOOG 0.000068 0.000155 0.000107 0.000151
IBM 0.000071 0.000107 0.000188 0.000114
MSFT 0.000083 0.000151 0.000114 0.000229
AAPL 0.368959
GOOG 0.624841
IBM 1.000000
MSFT 0.546960
dtype: float64
汇总和计算描述统计 唯一值以及成员资格
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
print('去重')
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(obj.unique())
print(obj.value_counts())
print()
print('判断元素存在')
mask = obj.isin(['b', 'c'])
print(mask)
print(obj[mask]) #只打印元素b和c
data = DataFrame({'Qu1':[1, 3, 4, 3, 4],
'Qu2':[2, 3, 1, 2, 3],
'Qu3':[1, 5, 2, 4, 4]})
print(data)
print(data.apply(pd.value_counts).fillna(0))
print(data.apply(pd.value_counts, axis = 1).fillna(0))
去重
['c' 'a' 'd' 'b']
a 3
c 3
b 2
d 1
dtype: int64
判断元素存在
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
0 c
5 b
6 b
7 c
8 c
dtype: object
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
1 2 3 4 5
0 2.0 1.0 0.0 0.0 0.0
1 0.0 0.0 2.0 0.0 1.0
2 1.0 1.0 0.0 1.0 0.0
3 0.0 1.0 1.0 1.0 0.0
4 0.0 0.0 1.0 2.0 0.0
处理缺失数据
import numpy as np
from pandas import Series
print('作为null处理的值')
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
print(string_data)
print(string_data.isnull())
string_data[0] = None
print(string_data.isnull())
作为null处理的值
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
0 False
1 False
2 True
3 False
dtype: bool
0 True
1 False
2 True
3 False
dtype: bool
处理缺失数据 滤除缺失数据
import numpy as np
from numpy import nan as NA
from pandas import Series, DataFrame
print('丢弃NA')
data = Series([1, NA, 3.5, NA, 7])
print(data.dropna())
print(data[data.notnull()])
print()
print('DataFrame对丢弃NA的处理')
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
print(data.dropna()) # 默认只要某行有NA就全部删除
print(data.dropna(how = 'all')) # 全部为NA才删除
data[4] = NA # 新增一列
print(data)
print(data.dropna(axis = 1, how = 'all'))
data = DataFrame(np.random.randn(7, 3))
data.ix[:4, 1] = NA
data.ix[:2, 2] = NA
print(data)
print(data.dropna(thresh = 2)) # 每行至少要有2个非NA元素
丢弃NA
0 1.0
2 3.5
4 7.0
dtype: float64
0 1.0
2 3.5
4 7.0
dtype: float64
DataFrame对丢弃NA的处理
0 1 2
0 1.0 6.5 3.0
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
0 1 2
0 1.943456 NaN NaN
1 0.614118 NaN NaN
2 0.557657 NaN NaN
3 1.405687 NaN -0.041758
4 -1.427269 NaN 0.387618
5 0.622559 0.160317 0.532953
6 0.501193 -0.624722 0.915884
0 1 2
3 1.405687 NaN -0.041758
4 -1.427269 NaN 0.387618
5 0.622559 0.160317 0.532953
6 0.501193 -0.624722 0.915884
处理缺失数据 填充缺失数据
import numpy as np
from numpy import nan as NA
import pandas as pd
from pandas import Series, DataFrame, Index
print('填充0')
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
print(df.fillna(0))
df.fillna(0, inplace = True)
print(df)
print()
print('不同行列填充不同的值')
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
print(df.fillna({1:0.5, 3:-1})) # 第3列不存在
print()
print('不同的填充方式')
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA
df.ix[4:, 2] = NA
print(df)
print(df.fillna(method = 'ffill'))
print(df.fillna(method = 'ffill', limit = 2))
print()
print('用统计数据填充')
data = Series([1., NA, 3.5, NA, 7])
print(data.fillna(data.mean()))
填充0
0 1 2
0 -0.675293 0.000000 0.000000
1 -0.930013 0.000000 0.000000
2 -1.873707 0.000000 0.000000
3 -0.917852 0.000000 -0.978845
4 1.353961 0.000000 0.308181
5 0.133550 -0.566566 0.357234
6 0.987223 -0.633359 0.408085
0 1 2
0 -0.675293 0.000000 0.000000
1 -0.930013 0.000000 0.000000
2 -1.873707 0.000000 0.000000
3 -0.917852 0.000000 -0.978845
4 1.353961 0.000000 0.308181
5 0.133550 -0.566566 0.357234
6 0.987223 -0.633359 0.408085
不同行列填充不同的值
0 1 2
0 -0.675293 0.500000 NaN
1 -0.930013 0.500000 NaN
2 -1.873707 0.500000 NaN
3 -0.917852 0.500000 -0.978845
4 1.353961 0.500000 0.308181
5 0.133550 -0.566566 0.357234
6 0.987223 -0.633359 0.408085
不同的填充方式
0 1 2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584 3.469360
2 1.425113 NaN -0.079921
3 -0.668439 NaN -1.251606
4 -0.032895 NaN NaN
5 1.316025 NaN NaN
0 1 2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584 3.469360
2 1.425113 -1.012584 -0.079921
3 -0.668439 -1.012584 -1.251606
4 -0.032895 -1.012584 -1.251606
5 1.316025 -1.012584 -1.251606
0 1 2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584 3.469360
2 1.425113 -1.012584 -0.079921
3 -0.668439 -1.012584 -1.251606
4 -0.032895 NaN -1.251606
5 1.316025 NaN -1.251606
用统计数据填充
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
层次化索引
import numpy as np
from pandas import Series, DataFrame, MultiIndex
print('Series的层次索引')
data = Series(np.random.randn(10),
index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
print(data)
print(data.index)
print(data.b)
print(data['b':'c'])
print(data[:2])
print(data.unstack())
print(data.unstack().stack())
print()
print('DataFrame的层次索引')
frame = DataFrame(np.arange(12).reshape((4, 3)),
index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
print(frame)
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
print(frame)
print(frame.ix['a', 1])
print(frame.ix['a', 2]['Colorado'])
print(frame.ix['a', 2]['Ohio']['Red'])
print()
print('直接用MultiIndex创建层次索引结构')
print(MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Gree', 'Red', 'Green']],
names = ['state', 'color']))
Series的层次索引
a 1 -0.926187
2 -0.389814
3 -0.193285
b 1 0.867688
2 1.044033
3 1.010972
c 1 -1.268897
2 -0.659126
d 2 1.579303
3 0.167862
dtype: float64
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
1 0.867688
2 1.044033
3 1.010972
dtype: float64
b 1 0.867688
2 1.044033
3 1.010972
c 1 -1.268897
2 -0.659126
dtype: float64
a 1 -0.926187
2 -0.389814
dtype: float64
1 2 3
a -0.926187 -0.389814 -0.193285
b 0.867688 1.044033 1.010972
c -1.268897 -0.659126 NaN
d NaN 1.579303 0.167862
a 1 -0.926187
2 -0.389814
3 -0.193285
b 1 0.867688
2 1.044033
3 1.010972
c 1 -1.268897
2 -0.659126
d 2 1.579303
3 0.167862
dtype: float64
DataFrame的层次索引
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
state color
Ohio Green 0
Red 1
Colorado Green 2
Name: (a, 1), dtype: int32
color
Green 5
Name: (a, 2), dtype: int32
4
直接用MultiIndex创建层次索引结构
MultiIndex(levels=[['Colorado', 'Ohio'], ['Gree', 'Green', 'Red']],
labels=[[1, 1, 0], [0, 2, 1]],
names=['state', 'color'])
层次化索引 重新分级顺序
import numpy as np
from pandas import Series, DataFrame
print('索引层级交换')
frame = DataFrame(np.arange(12).reshape((4, 3)),
index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame.index.names = ['key1', 'key2']
frame_swapped = frame.swaplevel('key1', 'key2')
print(frame_swapped)
print(frame_swapped.swaplevel(0, 1))
print()
print('根据索引排序')
print(frame.sortlevel('key2'))
print(frame.swaplevel(0, 1).sortlevel(0))
索引层级交换
Ohio Colorado
Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
根据索引排序
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
Ohio Colorado
Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
层次化索引 根据级别汇总统计
import numpy as np
from pandas import DataFrame
print('根据指定的key计算统计信息')
frame = DataFrame(np.arange(12).reshape((4, 3)),
index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame.index.names = ['key1', 'key2']
print(frame)
print(frame.sum(level = 'key2'))
根据指定的key计算统计信息
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Ohio Colorado
Green Red Green
key2
1 6 8 10
2 12 14 16
层次化索引 使用DataFrame的列
import numpy as np
from pandas import DataFrame
print('使用列生成层次索引')
frame = DataFrame({'a':range(7),
'b':range(7, 0, -1),
'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd':[0, 1, 2, 0, 1, 2, 3]})
print(frame)
print(frame.set_index(['c', 'd'])) # 把c/d列变成索引
print(frame.set_index(['c', 'd'], drop = False)) # 列依然保留
frame2 = frame.set_index(['c', 'd'])
print(frame2.reset_index()) #reset_index恢复
使用列生成层次索引
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
其它话题 整数索引
import numpy as np
import sys
from pandas import Series, DataFrame
print('整数索引')
ser = Series(np.arange(3.))
print(ser)
try:
print(ser[-1]) # 这里会有歧义
except:
print(sys.exc_info()[0])
ser2 = Series(np.arange(3.), index = ['a', 'b', 'c'])
print(ser2[-1])
ser3 = Series(range(3), index = [-5, 1, 3])
print(ser3.iloc[2]) # 避免直接用[2]产生的歧义
print()
print('对DataFrame使用整数索引')
frame = DataFrame(np.arange(6).reshape((3, 2)), index = [2, 0, 1])
print(frame)
print(frame.iloc[0])
print(frame.iloc[:, 1])
整数索引
0 0.0
1 1.0
2 2.0
dtype: float64
<class 'KeyError'>
2.0
2
对DataFrame使用整数索引
0 1
2 0 1
0 2 3
1 4 5
0 0
1 1
Name: 2, dtype: int32
2 1
0 3
1 5
Name: 1, dtype: int32
其它话题 面板(Pannel)数据
items axis:列条款轴
major axis:行主轴
minor axis:行副轴
import numpy as np
import pandas as pd
import pandas_datareader.data as web
from pandas import Series, DataFrame, Index, Panel
pdata = Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2016', '1/15/2016')) for stk in ['AAPL', 'GOOG', 'BIDU', 'MSFT']))
print(pdata)
pdata = pdata.swapaxes('items', 'minor')
print(pdata)
print()
print("访问顺序:# Item -> Major -> Minor")
print(pdata['Adj Close'])
print(pdata[:, '1/5/2016', :])
print(pdata['Adj Close', '1/6/2016', :])
print()
print('Panel与DataFrame相互转换')
stacked = pdata.ix[:, '1/7/2016':, :].to_frame()
print(stacked)
print(stacked.to_panel())
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 10 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2016-01-04 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: Open to Adj Close
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2016-01-04 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: AAPL to MSFT
访问顺序:# Item -> Major -> Minor
AAPL BIDU GOOG MSFT
Date
2016-01-04 103.057063 184.029999 741.840027 53.336436
2016-01-05 100.474523 187.369995 742.580017 53.579760
2016-01-06 98.508268 185.610001 743.619995 52.606467
2016-01-07 94.350769 174.369995 726.390015 50.776676
2016-01-08 94.849671 171.339996 714.469971 50.932406
2016-01-11 96.385500 170.369995 716.030029 50.903205
2016-01-12 97.784376 172.910004 726.070007 51.370385
2016-01-13 95.270312 166.039993 700.559998 50.260832
2016-01-14 97.353950 172.229996 714.719971 51.691573
2016-01-15 95.015969 163.919998 694.450012 49.628194
Open High Low Close Volume Adj Close
AAPL 105.750000 105.849998 102.410004 102.709999 55791000.0 100.474523
BIDU 186.710007 189.839996 185.339996 187.369995 3367900.0 187.369995
GOOG 746.450012 752.000000 738.640015 742.580017 1950700.0 742.580017
MSFT 54.930000 55.389999 54.540001 55.049999 34079700.0 53.579760
AAPL 98.508268
BIDU 185.610001
GOOG 743.619995
MSFT 52.606467
Name: 2016-01-06 00:00:00, dtype: float64
Panel与DataFrame相互转换
Open High Low Close Volume \
Date minor
2016-01-07 AAPL 98.680000 100.129997 96.430000 96.449997 81094400.0
BIDU 176.300003 180.979996 171.460007 174.369995 7009100.0
GOOG 730.309998 738.500000 719.059998 726.390015 2963700.0
MSFT 52.700001 53.490002 52.070000 52.169998 56564900.0
2016-01-08 AAPL 98.550003 99.110001 96.760002 96.959999 70798000.0
BIDU 177.589996 178.949997 170.720001 171.339996 4682800.0
GOOG 731.450012 733.229980 713.000000 714.469971 2450900.0
MSFT 52.369999 53.279999 52.150002 52.330002 48754000.0
2016-01-11 AAPL 98.970001 99.059998 97.339996 98.529999 49739400.0
BIDU 172.000000 172.839996 167.710007 170.369995 3202300.0
GOOG 716.609985 718.854980 703.539978 716.030029 2090600.0
MSFT 52.509998 52.849998 51.459999 52.299999 36943800.0
2016-01-12 AAPL 100.550003 100.690002 98.839996 99.959999 49154200.0
BIDU 172.399994 173.490005 169.820007 172.910004 3039600.0
GOOG 721.679993 728.750000 717.317017 726.070007 2024500.0
MSFT 52.759998 53.099998 52.060001 52.779999 36095500.0
2016-01-13 AAPL 100.320000 101.190002 97.300003 97.389999 62439600.0
BIDU 174.250000 175.360001 164.000000 166.039993 3961500.0
GOOG 730.849976 734.739990 698.609985 700.559998 2501700.0
MSFT 53.799999 54.070000 51.299999 51.639999 66883600.0
2016-01-14 AAPL 97.959999 100.480003 95.739998 99.519997 63170100.0
BIDU 166.300003 174.880005 163.009995 172.229996 4082200.0
GOOG 705.380005 721.924988 689.099976 714.719971 2225800.0
MSFT 52.000000 53.419998 51.570000 53.110001 52381900.0
2016-01-15 AAPL 96.199997 97.709999 95.360001 97.129997 79010000.0
BIDU 163.759995 165.309998 158.119995 163.919998 6342800.0
GOOG 692.289978 706.739990 685.369995 694.450012 3592400.0
MSFT 51.310001 51.970001 50.340000 50.990002 70739100.0
Adj Close
Date minor
2016-01-07 AAPL 94.350769
BIDU 174.369995
GOOG 726.390015
MSFT 50.776676
2016-01-08 AAPL 94.849671
BIDU 171.339996
GOOG 714.469971
MSFT 50.932406
2016-01-11 AAPL 96.385500
BIDU 170.369995
GOOG 716.030029
MSFT 50.903205
2016-01-12 AAPL 97.784376
BIDU 172.910004
GOOG 726.070007
MSFT 51.370385
2016-01-13 AAPL 95.270312
BIDU 166.039993
GOOG 700.559998
MSFT 50.260832
2016-01-14 AAPL 97.353950
BIDU 172.229996
GOOG 714.719971
MSFT 51.691573
2016-01-15 AAPL 95.015969
BIDU 163.919998
GOOG 694.450012
MSFT 49.628194
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 7 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2016-01-07 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: AAPL to MSFT
例题
08年的一个股票数据
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
names = ['date',
'time',
'opening_price', #开盘价
'ceiling_price', #最高价
'floor_price', #最低价
'closing_price', #收盘价
'volume', #成交量
'amount'] #成交金额
raw = pd.read_csv('SH600690.csv', names = names, header = None, index_col='date', parse_dates=True)
print(raw.head())
print()
time opening_price ceiling_price floor_price closing_price \
date
2008-01-02 09:35 22.50 22.63 22.50 22.51
2008-01-02 09:40 22.51 22.51 22.29 22.37
2008-01-02 09:45 22.39 22.62 22.38 22.62
2008-01-02 09:50 22.60 23.00 22.60 22.95
2008-01-02 09:55 22.98 23.20 22.89 23.20
volume amount
date
2008-01-02 2042.50 4604723.0
2008-01-02 1545.17 3460503.0
2008-01-02 1744.76 3921443.0
2008-01-02 5339.00 12225939.0
2008-01-02 12577.73 28947824.0
# 根据涨跌幅判断数据是否有效
print("根据涨跌幅判断数据是否有效")
def _valid_price(prices):
return (((prices.max() - prices.min()) / prices.min()) < 0.223).all()
根据涨跌幅判断数据是否有效
# 按日期分组
print("按日期分组")
days = raw.groupby(level = 0).agg(
{'opening_price':lambda prices: _valid_price(prices) and prices[0] or 0,
'ceiling_price':lambda prices: _valid_price(prices) and np.max(prices) or 0,
'floor_price':lambda prices: _valid_price(prices) and np.min(prices) or 0,
'closing_price':lambda prices: _valid_price(prices) and prices[-1] or 0,
'volume':'sum',
'amount':'sum'})
print(days.head())
print()
按日期分组
floor_price ceiling_price volume amount closing_price \
date
2008-01-02 22.29 24.50 200809.34 476179680.0 24.03
2008-01-03 23.81 25.20 166037.98 406906304.0 24.54
2008-01-04 23.68 24.76 149078.64 358418560.0 24.17
2008-01-07 23.75 24.75 93950.43 227289136.0 24.38
2008-01-08 23.49 24.38 149056.24 355752416.0 23.53
opening_price
date
2008-01-02 22.50
2008-01-03 24.03
2008-01-04 24.53
2008-01-07 24.03
2008-01-08 24.38
# 缺少数据处理,因为周末没有交易。
print("缺少数据处理,因为周末没有交易")
start = days.iloc[0:1].index.tolist()[0]
end = days.iloc[-2:-1].index.tolist()[0]
new_idx = pd.date_range(start = start, end = end)
print(new_idx)
data = days.reindex(new_idx) # 重新索引
zero_values = data.loc[~(data.volume > 0)].loc[:, ['volume', 'amount']]
data.update(zero_values.fillna(0)) # 交易量和金额填0
data.fillna(method = 'ffill', inplace = True) # 价格用前一天的填充
print(data.head())
print()
缺少数据处理,因为周末没有交易
DatetimeIndex(['2008-01-02', '2008-01-03', '2008-01-04', '2008-01-05',
'2008-01-06', '2008-01-07', '2008-01-08', '2008-01-09',
'2008-01-10', '2008-01-11',
...
'2008-12-21', '2008-12-22', '2008-12-23', '2008-12-24',
'2008-12-25', '2008-12-26', '2008-12-27', '2008-12-28',
'2008-12-29', '2008-12-30'],
dtype='datetime64[ns]', length=364, freq='D')
floor_price ceiling_price volume amount closing_price \
2008-01-02 22.29 24.50 200809.34 476179680.0 24.03
2008-01-03 23.81 25.20 166037.98 406906304.0 24.54
2008-01-04 23.68 24.76 149078.64 358418560.0 24.17
2008-01-05 23.68 24.76 0.00 0.0 24.17
2008-01-06 23.68 24.76 0.00 0.0 24.17
opening_price
2008-01-02 22.50
2008-01-03 24.03
2008-01-04 24.53
2008-01-05 24.53
2008-01-06 24.53
# 计算30各自然日里的股票平均波动周率
print("计算30各自然日里的股票平均波动周率")
def gen_item_group_index(total, group_len):
group_count = total / group_len
group_index = np.arange(total)
for i in range(int(group_count)):
group_index[i * group_len: (i+ 1) * group_len] = i
group_index[(i + 1) * group_len:] = i +1
return group_index.tolist()
period = 30
group_index = gen_item_group_index(len(data), period)
data['group_index'] = group_index
print(data.head().append(data.tail()))
计算30各自然日里的股票平均波动周率
floor_price ceiling_price volume amount closing_price \
2008-01-02 22.29 24.50 200809.34 476179680.0 24.03
2008-01-03 23.81 25.20 166037.98 406906304.0 24.54
2008-01-04 23.68 24.76 149078.64 358418560.0 24.17
2008-01-05 23.68 24.76 0.00 0.0 24.17
2008-01-06 23.68 24.76 0.00 0.0 24.17
2008-12-26 8.97 9.23 35093.86 31741640.0 9.08
2008-12-27 8.97 9.23 0.00 0.0 9.08
2008-12-28 8.97 9.23 0.00 0.0 9.08
2008-12-29 8.73 9.15 38576.07 34625144.0 9.11
2008-12-30 8.95 9.14 62983.38 56876600.0 8.96
opening_price group_index
2008-01-02 22.50 0
2008-01-03 24.03 0
2008-01-04 24.53 0
2008-01-05 24.53 0
2008-01-06 24.53 0
2008-12-26 9.15 11
2008-12-27 9.15 12
2008-12-28 9.15 12
2008-12-29 9.04 12
2008-12-30 9.14 12
# 为负表示先出现最高价再出现最低价,即下跌波动。
print("为负表示先出现最高价再出现最低价,即下跌波动")
def _ceiling_price(prices):
return prices.idxmin() < prices.idxmax() and np.max(prices) or (-np.max(prices))
group = data.groupby('group_index').agg(
{'volume': 'sum',
'floor_price': 'min',
'ceiling_price': _ceiling_price})
print(group.head())
date_col = pd.DataFrame({'group_index': group_index, 'date': new_idx})
print(date_col)
group['date'] = date_col.groupby('group_index').agg('first') # 为每个索引添加开始日期
print(group.head())
group['ripples_ratio'] = group.ceiling_price / group.floor_price # 计算并添加波动率
print(group.head())
print()
为负表示先出现最高价再出现最低价,即下跌波动
floor_price ceiling_price volume
group_index
0 20.20 -25.46 3200204.74
1 18.30 23.23 1310744.12
2 13.31 -21.40 1749185.77
3 9.18 -14.60 3978126.81
4 11.58 -14.85 2642320.68
date group_index
0 2008-01-02 0
1 2008-01-03 0
2 2008-01-04 0
3 2008-01-05 0
4 2008-01-06 0
5 2008-01-07 0
6 2008-01-08 0
7 2008-01-09 0
8 2008-01-10 0
9 2008-01-11 0
10 2008-01-12 0
11 2008-01-13 0
12 2008-01-14 0
13 2008-01-15 0
14 2008-01-16 0
15 2008-01-17 0
16 2008-01-18 0
17 2008-01-19 0
18 2008-01-20 0
19 2008-01-21 0
20 2008-01-22 0
21 2008-01-23 0
22 2008-01-24 0
23 2008-01-25 0
24 2008-01-26 0
25 2008-01-27 0
26 2008-01-28 0
27 2008-01-29 0
28 2008-01-30 0
29 2008-01-31 0
.. ... ...
334 2008-12-01 11
335 2008-12-02 11
336 2008-12-03 11
337 2008-12-04 11
338 2008-12-05 11
339 2008-12-06 11
340 2008-12-07 11
341 2008-12-08 11
342 2008-12-09 11
343 2008-12-10 11
344 2008-12-11 11
345 2008-12-12 11
346 2008-12-13 11
347 2008-12-14 11
348 2008-12-15 11
349 2008-12-16 11
350 2008-12-17 11
351 2008-12-18 11
352 2008-12-19 11
353 2008-12-20 11
354 2008-12-21 11
355 2008-12-22 11
356 2008-12-23 11
357 2008-12-24 11
358 2008-12-25 11
359 2008-12-26 11
360 2008-12-27 12
361 2008-12-28 12
362 2008-12-29 12
363 2008-12-30 12
[364 rows x 2 columns]
floor_price ceiling_price volume date
group_index
0 20.20 -25.46 3200204.74 2008-01-02
1 18.30 23.23 1310744.12 2008-02-01
2 13.31 -21.40 1749185.77 2008-03-02
3 9.18 -14.60 3978126.81 2008-04-01
4 11.58 -14.85 2642320.68 2008-05-01
floor_price ceiling_price volume date ripples_ratio
group_index
0 20.20 -25.46 3200204.74 2008-01-02 -1.260396
1 18.30 23.23 1310744.12 2008-02-01 1.269399
2 13.31 -21.40 1749185.77 2008-03-02 -1.607814
3 9.18 -14.60 3978126.81 2008-04-01 -1.590414
4 11.58 -14.85 2642320.68 2008-05-01 -1.282383
# 波动率排序
print("波动率排序")
ripples = group.sort_values('ripples_ratio', ascending = False)
print(ripples)
print(ripples.head(10).ripples_ratio.mean())
print(ripples.tail(10).ripples_ratio.mean())
print()
波动率排序
floor_price ceiling_price volume date ripples_ratio
group_index
6 8.55 11.48 2832764.49 2008-06-30 1.342690
10 7.30 9.52 1405050.39 2008-10-28 1.304110
1 18.30 23.23 1310744.12 2008-02-01 1.269399
11 8.83 10.50 2247432.74 2008-11-27 1.189128
12 8.73 -9.23 101559.45 2008-12-27 -1.057274
0 20.20 -25.46 3200204.74 2008-01-02 -1.260396
8 7.70 -9.85 1193866.98 2008-08-29 -1.279221
4 11.58 -14.85 2642320.68 2008-05-01 -1.282383
9 6.92 -9.28 748169.86 2008-09-28 -1.341040
7 8.02 -11.11 1297577.05 2008-07-30 -1.385287
5 8.02 -12.44 1285826.32 2008-05-31 -1.551122
3 9.18 -14.60 3978126.81 2008-04-01 -1.590414
2 13.31 -21.40 1749185.77 2008-03-02 -1.607814
-0.2500274725153588
-1.1165823091576255
# 计算涨跌幅
print("计算涨跌幅")
rise = data.closing_price.diff()
data['rise'] = rise
print(data.head())
计算涨跌幅
floor_price ceiling_price volume amount closing_price \
2008-01-02 22.29 24.50 200809.34 476179680.0 24.03
2008-01-03 23.81 25.20 166037.98 406906304.0 24.54
2008-01-04 23.68 24.76 149078.64 358418560.0 24.17
2008-01-05 23.68 24.76 0.00 0.0 24.17
2008-01-06 23.68 24.76 0.00 0.0 24.17
opening_price group_index rise
2008-01-02 22.50 0 NaN
2008-01-03 24.03 0 0.51
2008-01-04 24.53 0 -0.37
2008-01-05 24.53 0 0.00
2008-01-06 24.53 0 0.00