import numpy as np
import pandas as pd
数据离散化
什么是数据离散化?
-
连续属性的离散化就是在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数 值代表落在每个子区间中的属性值。
-
离散化有很多种方法,这使用一种最简单的方式去操作,例如:
-
原始人的身高数据:165,174,160,180,159,163,192,184
-
假设按照身高分几个区间段:150~165, 165180,180195
-
这样我们将数据分到了三个区间段,我可以对应的标记为矮、中、高三个类别,最终要处理成一个"哑变量"矩阵
# 股票的涨跌幅离散化
读取股票数据
data = pd.read_csv("./data/stock_day.csv")
p_change = data["p_change"]
p_change
2018-02-27 2.68
2018-02-26 3.02
2018-02-23 2.42
2018-02-22 1.64
2018-02-14 2.05
...
2015-03-06 8.51
2015-03-05 2.02
2015-03-04 1.57
2015-03-03 1.44
2015-03-02 2.62
Name: p_change, Length: 643, dtype: float64
将股票涨跌幅数据进行分组
自动分组 pd.qcut(data, q):
对数据进行分组自动分组,q是分组的个数,一般会与value_counts搭配使用,统计每组的个数
- series.value_counts():统计每组个数
# 自行分组
qcut = pd.qcut(p_change,10)
# 分到每组的个数
qcut.value_counts()
(5.27, 10.03] 65
(0.26, 0.94] 65
(-0.462, 0.26] 65
(-10.030999999999999, -4.836] 65
(2.938, 5.27] 64
(1.738, 2.938] 64
(-1.352, -0.462] 64
(-2.444, -1.352] 64
(-4.836, -2.444] 64
(0.94, 1.738] 63
Name: p_change, dtype: int64
自定义区间分组 pd.cut(data, bins)
# 自定义分组区间
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
p_counts = pd.cut(p_change, bins)
# p_counts
p_counts.value_counts()
(0, 3] 215
(-3, 0] 188
(3, 5] 57
(-5, -3] 51
(7, 100] 35
(5, 7] 35
(-100, -7] 34
(-7, -5] 28
Name: p_change, dtype: int64
股票涨跌幅分组数据变成one-hot编码
-
什么是one-hot编码
把每个类别生成一个布尔列,这些列中只有一列可以为这个样本取值为1.其又被称为热编码。
-
pd.get_dummies(data, prefix=None)
- data:array-like, Series, or DataFrame
- prefix:分组名字
dummies = pd.get_dummies(p_counts,prefix='rise')
dummies.head()
rise_(-100, -7] | rise_(-7, -5] | rise_(-5, -3] | rise_(-3, 0] | rise_(0, 3] | rise_(3, 5] | rise_(5, 7] | rise_(7, 100] | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-26 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-02-23 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-22 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-14 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
数据合并
使用pd.concat实现数据合并
- pd.concat([data1, data2], axis=1)
- 按照行或列进行合并,axis=0为列索引,axis=1为行索引
# 将刚才处理好的one-hot编码与原数据按照行合并
result = pd.concat([data,dummies], axis=1)
result
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | ... | v_ma20 | turnover | rise_(-100, -7] | rise_(-7, -5] | rise_(-5, -3] | rise_(-3, 0] | rise_(0, 3] | rise_(3, 5] | rise_(5, 7] | rise_(7, 100] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | ... | 55576.11 | 2.39 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | ... | 56007.50 | 1.53 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | ... | 56372.85 | 1.32 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | ... | 60149.60 | 0.90 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | ... | 61716.11 | 0.58 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2015-03-06 | 13.17 | 14.48 | 14.28 | 13.13 | 179831.72 | 1.12 | 8.51 | 13.112 | 13.112 | 13.112 | ... | 115090.18 | 6.16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2015-03-05 | 12.88 | 13.45 | 13.16 | 12.87 | 93180.39 | 0.26 | 2.02 | 12.820 | 12.820 | 12.820 | ... | 98904.79 | 3.19 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-04 | 12.80 | 12.92 | 12.90 | 12.61 | 67075.44 | 0.20 | 1.57 | 12.707 | 12.707 | 12.707 | ... | 100812.93 | 2.30 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-03 | 12.52 | 13.06 | 12.70 | 12.52 | 139071.61 | 0.18 | 1.44 | 12.610 | 12.610 | 12.610 | ... | 117681.67 | 4.76 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-02 | 12.25 | 12.67 | 12.52 | 12.20 | 96291.73 | 0.32 | 2.62 | 12.520 | 12.520 | 12.520 | ... | 96291.73 | 3.30 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
643 rows × 22 columns
pd.merge合并
- pd.merge(left, right, how=‘inner’, on=None)
- 可以指定按照两组数据的共同键值对合并或者左右各自
- left: DataFrame
- right: 另一个DataFrame
- on: 指定的共同键
- how:按照什么方式连接
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
# 默认内连接,按照共同键值对合并,即按照键值对的交集
result = pd.merge(left, right, on=['key1', 'key2'])
result
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
# 外连接,按照键值对的并集进行合并
result = pd.merge(left, right, on=['key1', 'key2'], how='outer')
result
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
# 左连接,按照左边数据的键值对合并
result = pd.merge(left, right, on=['key1', 'key2'], how='left')
result
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
# 右连接,按照右边数据的键值对合并
result = pd.merge(left, right, on=['key1', 'key2'], how='right')
result
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
总结:
- pd.concat([数据1, 数据2], axis=**)
- pd.merge(left, right, how=, on=)
- how – 以何种方式连接
- on – 连接的键的依据是哪几个