数据离散化
简介
数据准备
import pandas as pd
data = pd. read_csv( "../data/stock_day.csv" )
p_change = data[ "p_change" ]
p_change
2018-02-27 2.68
2018-02-26 3.02
2018-02-23 2.42
2018-02-22 1.64
2018-02-14 2.05
...
2015-03-06 8.51
2015-03-05 2.02
2015-03-04 1.57
2015-03-03 1.44
2015-03-02 2.62
Name: p_change, Length: 643, dtype: float64
开始分组
使用的工具:
pd.qcut(data, q),待分的数据、要分成几组,Series.value_counts()统计每组数据个数 pd.cut(data, bins),待分的数据、自定义的分组区间(如[-100, -50, 0, 50, 100])
qcut = pd. qcut( p_change, 10 )
qcut
2018-02-27 (1.738, 2.938]
2018-02-26 (2.938, 5.27]
2018-02-23 (1.738, 2.938]
2018-02-22 (0.94, 1.738]
2018-02-14 (1.738, 2.938]
...
2015-03-06 (5.27, 10.03]
2015-03-05 (1.738, 2.938]
2015-03-04 (0.94, 1.738]
2015-03-03 (0.94, 1.738]
2015-03-02 (1.738, 2.938]
Name: p_change, Length: 643, dtype: category
Categories (10, interval[float64, right]): [(-10.030999999999999, -4.836] < (-4.836, -2.444] < (-2.444, -1.352] < (-1.352, -0.462] ... (0.94, 1.738] < (1.738, 2.938] < (2.938, 5.27] < (5.27, 10.03]]
qcut. value_counts( )
(-10.030999999999999, -4.836] 65
(-0.462, 0.26] 65
(0.26, 0.94] 65
(5.27, 10.03] 65
(-4.836, -2.444] 64
(-2.444, -1.352] 64
(-1.352, -0.462] 64
(1.738, 2.938] 64
(2.938, 5.27] 64
(0.94, 1.738] 63
Name: p_change, dtype: int64
bins = [ - 100 , - 7 , - 5 , - 3 , 0 , 3 , 5 , 7 , 100 ]
cut = pd. cut( p_change, bins= bins)
cut. value_counts( )
(0, 3] 215
(-3, 0] 188
(3, 5] 57
(-5, -3] 51
(5, 7] 35
(7, 100] 35
(-100, -7] 34
(-7, -5] 28
Name: p_change, dtype: int64
转化为one-hot编码
使用pd.get_dummies(data, prefix),data为已经分好组的数据,prefix为生成的数据列名的前缀
pd. get_dummies( cut)
(-100, -7] (-7, -5] (-5, -3] (-3, 0] (0, 3] (3, 5] (5, 7] (7, 100] 2018-02-27 0 0 0 0 1 0 0 0 2018-02-26 0 0 0 0 0 1 0 0 2018-02-23 0 0 0 0 1 0 0 0 2018-02-22 0 0 0 0 1 0 0 0 2018-02-14 0 0 0 0 1 0 0 0 ... ... ... ... ... ... ... ... ... 2015-03-06 0 0 0 0 0 0 0 1 2015-03-05 0 0 0 0 1 0 0 0 2015-03-04 0 0 0 0 1 0 0 0 2015-03-03 0 0 0 0 1 0 0 0 2015-03-02 0 0 0 0 1 0 0 0
643 rows × 8 columns
dummies = pd. get_dummies( cut, prefix= "rise" )
dummies
rise_(-100, -7] rise_(-7, -5] rise_(-5, -3] rise_(-3, 0] rise_(0, 3] rise_(3, 5] rise_(5, 7] rise_(7, 100] 2018-02-27 0 0 0 0 1 0 0 0 2018-02-26 0 0 0 0 0 1 0 0 2018-02-23 0 0 0 0 1 0 0 0 2018-02-22 0 0 0 0 1 0 0 0 2018-02-14 0 0 0 0 1 0 0 0 ... ... ... ... ... ... ... ... ... 2015-03-06 0 0 0 0 0 0 0 1 2015-03-05 0 0 0 0 1 0 0 0 2015-03-04 0 0 0 0 1 0 0 0 2015-03-03 0 0 0 0 1 0 0 0 2015-03-02 0 0 0 0 1 0 0 0
643 rows × 8 columns
数据合并
pd.concat实现数据合并
pd.concat([data1, data2,…], axis=1)按照行或列进行合并axis=1按行索引,axis=0按列索引
pd. concat( [ data, dummies] , axis= 1 )
open high close low volume price_change p_change ma5 ma10 ma20 ... v_ma20 turnover rise_(-100, -7] rise_(-7, -5] rise_(-5, -3] rise_(-3, 0] rise_(0, 3] rise_(3, 5] rise_(5, 7] rise_(7, 100] 2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 22.942 22.142 22.875 ... 55576.11 2.39 0 0 0 0 1 0 0 0 2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 22.406 21.955 22.942 ... 56007.50 1.53 0 0 0 0 0 1 0 0 2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 21.938 21.929 23.022 ... 56372.85 1.32 0 0 0 0 1 0 0 0 2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 21.446 21.909 23.137 ... 60149.60 0.90 0 0 0 0 1 0 0 0 2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 21.366 21.923 23.253 ... 61716.11 0.58 0 0 0 0 1 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 2015-03-06 13.17 14.48 14.28 13.13 179831.72 1.12 8.51 13.112 13.112 13.112 ... 115090.18 6.16 0 0 0 0 0 0 0 1 2015-03-05 12.88 13.45 13.16 12.87 93180.39 0.26 2.02 12.820 12.820 12.820 ... 98904.79 3.19 0 0 0 0 1 0 0 0 2015-03-04 12.80 12.92 12.90 12.61 67075.44 0.20 1.57 12.707 12.707 12.707 ... 100812.93 2.30 0 0 0 0 1 0 0 0 2015-03-03 12.52 13.06 12.70 12.52 139071.61 0.18 1.44 12.610 12.610 12.610 ... 117681.67 4.76 0 0 0 0 1 0 0 0 2015-03-02 12.25 12.67 12.52 12.20 96291.73 0.32 2.62 12.520 12.520 12.520 ... 96291.73 3.30 0 0 0 0 1 0 0 0
643 rows × 22 columns
pd.merge实现数据合并
left = pd. DataFrame( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] ,
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] ,
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } )
right = pd. DataFrame( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] ,
'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ] ,
'D' : [ 'D0' , 'D1' , 'D2' , 'D3' ] } )
left
key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right
key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
pd. merge( left, right, on= [ "key1" , "key2" ] )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2
pd. merge( left, right, on= [ "key1" , "key2" ] , how= "outer" )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3
pd. merge( left, right, on= [ "key1" , "key2" ] , how= "left" )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN
pd. merge( left, right, on= [ "key1" , "key2" ] , how= "right" )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 3 K2 K0 NaN NaN C3 D3