我喜欢用sklearn.preprocessing.LabelEncoder进行字母到数字的转换:
from sklearn.preprocessing import LabelEncoder
# Perform the groupby (before converting letters to digits).
df = df.groupby(["ID_0", "ID_1"]).size().rename("count").reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[["ID_0", "ID_1"]].values.flat)
# Convert to digits.
df[["ID_0", "ID_1"]] = df[["ID_0", "ID_1"]].apply(le.transform)
结果输出:
ID_0 ID_1 count
0 0 2 2
1 1 3 1
2 2 0 3
3 3 4 1
如果要在以后转换回字母,可以使用le.inverse_transform:
df[["ID_0", "ID_1"]] = df[["ID_0", "ID_1"]].apply(le.inverse_transform)
哪个映射回预期:
ID_0 ID_1 count
0 a c 2
1 b f 1
2 c a 3
3 f g 1
如果只想知道哪个数字对应哪个字母,可以查看le.classes_属性.这将为您提供一个字母数组,该字母数组由它编码为的数字索引:
le.classes_
["a" "b" "c" "f" "g"]
为了获得更直观的表示,您可以将其转换为系列:
pd.Series(le.classes_)
0 a
1 b
2 c
3 f
4 g
计时
使用较大版本的示例数据和以下设置:
df2 = pd.concat([df]*10**5, ignore_index=True)
def root(df):
df = df.groupby(["ID_0", "ID_1"]).size().rename("count").reset_index()
le = LabelEncoder()
le.fit(df[["ID_0", "ID_1"]].values.flat)
df[["ID_0", "ID_1"]] = df[["ID_0", "ID_1"]].apply(le.transform)
return df
def pir2(df):
unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)
return df.stack().map(mapping).unstack() .groupby(df.columns.tolist()).size().reset_index(name="count")
我得到以下计时:
%timeit root(df2)
10 loops, best of 3: 101 ms per loop
%timeit pir2(df2)
1 loops, best of 3: 1.69 s per loop