Pandas学习记录:数据离散化与数据合并、one-hot编码、pd.concat、pd.merge

数据离散化

简介

在这里插入图片描述

数据准备

import pandas as pd
data = pd.read_csv("../data/stock_day.csv")
p_change = data["p_change"]
p_change
2018-02-27    2.68
2018-02-26    3.02
2018-02-23    2.42
2018-02-22    1.64
2018-02-14    2.05
              ... 
2015-03-06    8.51
2015-03-05    2.02
2015-03-04    1.57
2015-03-03    1.44
2015-03-02    2.62
Name: p_change, Length: 643, dtype: float64

开始分组

使用的工具:

  1. pd.qcut(data, q),待分的数据、要分成几组,Series.value_counts()统计每组数据个数
  2. pd.cut(data, bins),待分的数据、自定义的分组区间(如[-100, -50, 0, 50, 100])
# qcut分组
qcut = pd.qcut(p_change, 10)
# 查看数据情况
qcut
2018-02-27    (1.738, 2.938]
2018-02-26     (2.938, 5.27]
2018-02-23    (1.738, 2.938]
2018-02-22     (0.94, 1.738]
2018-02-14    (1.738, 2.938]
                   ...      
2015-03-06     (5.27, 10.03]
2015-03-05    (1.738, 2.938]
2015-03-04     (0.94, 1.738]
2015-03-03     (0.94, 1.738]
2015-03-02    (1.738, 2.938]
Name: p_change, Length: 643, dtype: category
Categories (10, interval[float64, right]): [(-10.030999999999999, -4.836] < (-4.836, -2.444] < (-2.444, -1.352] < (-1.352, -0.462] ... (0.94, 1.738] < (1.738, 2.938] < (2.938, 5.27] < (5.27, 10.03]]
# 统计每组个数
qcut.value_counts()
(-10.030999999999999, -4.836]    65
(-0.462, 0.26]                   65
(0.26, 0.94]                     65
(5.27, 10.03]                    65
(-4.836, -2.444]                 64
(-2.444, -1.352]                 64
(-1.352, -0.462]                 64
(1.738, 2.938]                   64
(2.938, 5.27]                    64
(0.94, 1.738]                    63
Name: p_change, dtype: int64
# cut自定义区间分组
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
cut = pd.cut(p_change, bins=bins)
cut.value_counts()
(0, 3]        215
(-3, 0]       188
(3, 5]         57
(-5, -3]       51
(5, 7]         35
(7, 100]       35
(-100, -7]     34
(-7, -5]       28
Name: p_change, dtype: int64

转化为one-hot编码

在这里插入图片描述

使用pd.get_dummies(data, prefix),data为已经分好组的数据,prefix为生成的数据列名的前缀

pd.get_dummies(cut)
(-100, -7](-7, -5](-5, -3](-3, 0](0, 3](3, 5](5, 7](7, 100]
2018-02-2700001000
2018-02-2600000100
2018-02-2300001000
2018-02-2200001000
2018-02-1400001000
...........................
2015-03-0600000001
2015-03-0500001000
2015-03-0400001000
2015-03-0300001000
2015-03-0200001000

643 rows × 8 columns

dummies = pd.get_dummies(cut, prefix="rise")
dummies
rise_(-100, -7]rise_(-7, -5]rise_(-5, -3]rise_(-3, 0]rise_(0, 3]rise_(3, 5]rise_(5, 7]rise_(7, 100]
2018-02-2700001000
2018-02-2600000100
2018-02-2300001000
2018-02-2200001000
2018-02-1400001000
...........................
2015-03-0600000001
2015-03-0500001000
2015-03-0400001000
2015-03-0300001000
2015-03-0200001000

643 rows × 8 columns

数据合并

pd.concat实现数据合并

  • pd.concat([data1, data2,…], axis=1)按照行或列进行合并axis=1按行索引,axis=0按列索引
# 将处理好的one-hot数据与原数据合并
pd.concat([data, dummies], axis=1)
openhighcloselowvolumeprice_changep_changema5ma10ma20...v_ma20turnoverrise_(-100, -7]rise_(-7, -5]rise_(-5, -3]rise_(-3, 0]rise_(0, 3]rise_(3, 5]rise_(5, 7]rise_(7, 100]
2018-02-2723.5325.8824.1623.5395578.030.632.6822.94222.14222.875...55576.112.3900001000
2018-02-2622.8023.7823.5322.8060985.110.693.0222.40621.95522.942...56007.501.5300000100
2018-02-2322.8823.3722.8222.7152914.010.542.4221.93821.92923.022...56372.851.3200001000
2018-02-2222.2522.7622.2822.0236105.010.361.6421.44621.90923.137...60149.600.9000001000
2018-02-1421.4921.9921.9221.4823331.040.442.0521.36621.92323.253...61716.110.5800001000
..................................................................
2015-03-0613.1714.4814.2813.13179831.721.128.5113.11213.11213.112...115090.186.1600000001
2015-03-0512.8813.4513.1612.8793180.390.262.0212.82012.82012.820...98904.793.1900001000
2015-03-0412.8012.9212.9012.6167075.440.201.5712.70712.70712.707...100812.932.3000001000
2015-03-0312.5213.0612.7012.52139071.610.181.4412.61012.61012.610...117681.674.7600001000
2015-03-0212.2512.6712.5212.2096291.730.322.6212.52012.52012.520...96291.733.3000001000

643 rows × 22 columns

pd.merge实现数据合并在这里插入图片描述

# 数据准备
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K1'],
                         'A': ['A0', 'A1', 'A2', 'A3'],
                         'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                         'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3']})

left
key1key2AB
0K0K0A0B0
1K0K1A1B1
2K1K0A2B2
3K2K1A3B3
right
key1key2CD
0K0K0C0D0
1K1K0C1D1
2K1K0C2D2
3K2K0C3D3
# 根据key1和key2进行连接,默认内连接inner
pd.merge(left, right, on=["key1", "key2"])
key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
# 外连接,相同的先连起来(可以重复),不存在相同的另一边补nan
pd.merge(left, right, on=["key1", "key2"], how="outer")
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
5K2K0NaNNaNC3D3
# 左连接,相同的相连,左面的一定保留
pd.merge(left, right, on=["key1", "key2"], how="left")
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
# 右连接
pd.merge(left, right, on=["key1", "key2"], how="right")
key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
3K2K0NaNNaNC3D3
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值