python 分类变量编码_Python – 加速将分类变量转换为数字索引

最新推荐文章于 2023-07-23 00:30:00 发布

weixin_39752880

最新推荐文章于 2023-07-23 00:30:00 发布

阅读量214

点赞数

文章标签： python 分类变量编码

使用

factorize：

df['col'] = pd.factorize(df.col)[0]

print (df)

col

0 0

1 1

2 0

3 0

4 1

编辑：

正如Jeff在评论中提到的那样,最好的是将列转换为分类,主要是因为少了memory usage：

df['col'] = df['col'].astype("category")

时序：

有趣的是,大型df熊猫的速度比numpy快.我不敢相信.

LEN(DF)= 500K：

In [29]: %timeit (a(df1))

100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))

100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))

10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))

10 loops, best of 3: 24.6 ms per loop

LEN(DF)= 5K：

In [38]: %timeit (a(df1))

1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))

The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))

The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))

1000 loops, best of 3: 294 µs per loop

LEN(DF)= 5：

In [46]: %timeit (a(df1))

1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))

1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))

The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))

The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

测试代码：

d = {'col': ["baked","beans","baked","baked","beans"]}

df = pd.DataFrame(data=d)

print (df)

df = pd.concat([df]*100000).reset_index(drop=True)

#test for 5k

#df = pd.concat([df]*1000).reset_index(drop=True)

df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):

df['col'] = pd.factorize(df.col)[0]

return df

def a1(df):

idx,_ = pd.factorize(df.col)

df['col'] = idx

return df

def b(df):

df['col'] = np.unique(df['col'],return_inverse=True)[1]

return df

def b1(df):

_,idx = np.unique(df['col'],return_inverse=True)

df['col'] = idx

return df

print (a(df1))

print (a1(df2))

print (b(df3))

print (b1(df4))

weixin_39752880

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。