python学习之:pandas 使用函数或者映射进行数据替换;pandas 离散化数据和分箱

函数或映射进行值替代

df = pd.DataFrame([['jeff',18]
                 ,['herry',20]
                 ,['chris',25]
                 ,['culry',38]],columns=['name','age'])
df
nameage
0jeff18
1herry20
2chris25
3culry38
# 通过 Series.map() 对 Series 信息进行操作;覆盖或者增加新的 Series
info = {'jeff':['dog',3]
        ,'herry':['cat',2]
        ,'chris':['cat',3]
        ,'culry':['cat',1]
       }

df['pet'] = df['name'].map(lambda k:info[k][0])
df['pet_name'] = df['name'].map(lambda k: info[k][1])
df
nameagepetpet_name
0jeff18dog3
1herry20cat2
2chris25cat3
3culry38cat1

分箱

import numpy as np
import pandas as pd
ages = np.random.randint(4,100,30)
ages
array([60, 64, 98, 63, 73, 75, 62, 42, 43, 18, 70, 35, 32, 87,  4, 78, 78,
       37, 61, 47, 95, 62, 54, 90, 41, 48, 29, 27, 61, 91])

按照指定的边界值来分箱

  • pd.cut()
bins = [10,20,30,40,50,60,70,80,90,100]
cutdata = pd.cut(ages
                 ,bins
                 ,right=False
#                  ,labels=[str(i) for i in range(9)]
                )
# 分箱切出来的数据,是一个 Categorical 对象
cutdata
[[60, 70), [60, 70), [90, 100), [60, 70), [70, 80), ..., [40, 50), [20, 30), [20, 30), [60, 70), [90, 100)]
Length: 30
Categories (9, interval[int64]): [[10, 20) < [20, 30) < [30, 40) < [40, 50) ... [60, 70) < [70, 80) < [80, 90) < [90, 100)]
# 展示所有的区间,方式是左开右闭
cutdata.categories
IntervalIndex([(10, 20], (20, 30], (30, 40], (40, 50], (50, 60], (60, 70], (70, 80], (80, 90], (90, 100]],
              closed='right',
              dtype='interval[int64]')
cutdata = pd.cut(ages
                 ,bins
                 ,right=False  # 设置左闭右开
                 ,labels=[str(i) for i in range(9)] # 设置每个区间的 label
                )
cutdata
['5', '5', '8', '5', '6', ..., '3', '1', '1', '5', '8']
Length: 30
Categories (9, object): ['0' < '1' < '2' < '3' ... '5' < '6' < '7' < '8']
cutdata.codes
array([ 4,  5,  8,  5,  6,  6,  5,  3,  3,  0,  5,  2,  2,  7, -1,  6,  6,
        2,  5,  3,  8,  5,  4,  7,  3,  3,  1,  1,  5,  8], dtype=int8)
# pandas 内置的计数工具来统计各个箱内有多少个值
cutdata.value_counts()
(10, 20]     1
(20, 30]     2
(30, 40]     3
(40, 50]     5
(50, 60]     2
(60, 70]     7
(70, 80]     4
(80, 90]     2
(90, 100]    3
dtype: int64

按照指定的分位数进行分箱

import matplotlib.pyplot as plt
qcutdata = pd.qcut(ages,q=[0,0.25,0.5,0.75,1])
qcutdata
[(41.25, 61.0], (61.0, 74.5], (74.5, 98.0], (61.0, 74.5], (61.0, 74.5], ..., (41.25, 61.0], (3.999, 41.25], (3.999, 41.25], (41.25, 61.0], (74.5, 98.0]]
Length: 30
Categories (4, interval[float64]): [(3.999, 41.25] < (41.25, 61.0] < (61.0, 74.5] < (74.5, 98.0]]
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

暖仔会飞

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值