训练集df_train, 测试集df_test
离散型特征集cat_features = [a, b, c, d, e, f, g]
策略:"unseen labels": df_test里的未见的离散型特征的value值请取代成 df_train里相同特征里的最不常见的值 least frequent value
代码实现:
import pandas as pd
import numpy as np
df_train = pd.DataFrame([['a', 'b', 'a', 'a', 'a', 'a']
, ['female', 'male', 'male', 'male', 'female', 'male']])
df_train = df_train.transpose()
df_train.columns = ['type', 'gender']
df_test = pd.DataFrame([['b', 'c', 'a'], ['boy', 'female', 'female']])
df_test = df_test.transpose()
df_test.columns = df_train.columns
df_train
df_test
def replace_lfq(col, train, test):
freq = train[[col]].groupby(col).size()
labels = freq.index
least_value = freq.index[np.argmin(freq.values)]
test.loc[~test[col].isin(labels), col] = least_value
# 取反,不包含在上面
for i in df_test.columns:
replace_lfq(i, df_train, df_test)
df_test
所以面试表现与日后工作表现的相关性到底有多少?柠檬。