python 分类变量编码_Python – 加速将分类变量转换为数字索引

使用

factorize:

df['col'] = pd.factorize(df.col)[0]

print (df)

col

0 0

1 1

2 0

3 0

4 1

编辑:

正如Jeff在评论中提到的那样,最好的是将列转换为分类,主要是因为少了memory usage:

df['col'] = df['col'].astype("category")

时序:

有趣的是,大型df熊猫的速度比numpy快.我不敢相信.

LEN(DF)= 500K:

In [29]: %timeit (a(df1))

100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))

100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))

10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))

10 loops, best of 3: 24.6 ms per loop

LEN(DF)= 5K:

In [38]: %timeit (a(df1))

1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))

The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))

The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))

1000 loops, best of 3: 294 µs per loop

LEN(DF)= 5:

In [46]: %timeit (a(df1))

1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))

1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))

The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))

The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

测试代码:

d = {'col': ["baked","beans","baked","baked","beans"]}

df = pd.DataFrame(data=d)

print (df)

df = pd.concat([df]*100000).reset_index(drop=True)

#test for 5k

#df = pd.concat([df]*1000).reset_index(drop=True)

df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):

df['col'] = pd.factorize(df.col)[0]

return df

def a1(df):

idx,_ = pd.factorize(df.col)

df['col'] = idx

return df

def b(df):

df['col'] = np.unique(df['col'],return_inverse=True)[1]

return df

def b1(df):

_,idx = np.unique(df['col'],return_inverse=True)

df['col'] = idx

return df

print (a(df1))

print (a1(df2))

print (b(df3))

print (b1(df4))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值