python 分类变量编码_Python - 加快将分类变量转换为数字索引

最新推荐文章于 2022-09-28 08:52:59 发布

weixin_39607474

最新推荐文章于 2022-09-28 08:52:59 发布

阅读量110

点赞数

文章标签： python 分类变量编码

I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that:

import pandas as pd

import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}

df = pd.DataFrame(data=d)

uniq_lab = np.unique(df['col'])

for lab in uniq_lab:

df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

which converts the data frame:

col

0 baked

1 beans

2 baked

3 baked

4 beans

into the data frame:

col

0 0.0

1 1.0

2 0.0

3 0.0

4 1.0

as desired. But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files. I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently. Thanks in advance for any thoughts.

解决方案df['col'] = pd.factorize(df.col)[0]

print (df)

col

0 0

1 1

2 0

3 0

4 1

EDIT:

As Jeff mentioned in comment, then the best is convert column to categorical mainly because less memory usage:

df['col'] = df['col'].astype("category")

Timings:

It is interesting, in large df pandas is faster as numpy. I cant believe it.

len(df)=500k:

In [29]: %timeit (a(df1))

100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))

100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))

10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))

10 loops, best of 3: 24.6 ms per loop

len(df)=5k:

In [38]: %timeit (a(df1))

1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))

The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))

The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.

1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))

1000 loops, best of 3: 294 µs per loop

len(df)=5:

In [46]: %timeit (a(df1))

1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))

1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))

The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))

The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.

10000 loops, best of 3: 164 µs per loop

Code for testing:

d = {'col': ["baked","beans","baked","baked","beans"]}

df = pd.DataFrame(data=d)

print (df)

df = pd.concat([df]*100000).reset_index(drop=True)

#test for 5k

#df = pd.concat([df]*1000).reset_index(drop=True)

df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):

df['col'] = pd.factorize(df.col)[0]

return df

def a1(df):

idx,_ = pd.factorize(df.col)

df['col'] = idx

return df

def b(df):

df['col'] = np.unique(df['col'],return_inverse=True)[1]

return df

def b1(df):

_,idx = np.unique(df['col'],return_inverse=True)

df['col'] = idx

return df

print (a(df1))

print (a1(df2))

print (b(df3))

print (b1(df4))

weixin_39607474

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫