FM原理及实现
1 FM是个啥?
FM模型首先是一个有监督学习方法,主要用在CTR预估上,适用的情形是高维稀疏!优势是可以自动组合交叉特征,替代人工进行特征工程~
同时很多场景下FM模型作为一开始的embedding,相比word2vec这样无监督的embedding,基于有监督FM模型的embedding效果会更好~
2 FM的数学原理
有几个亮点:
- FM模型的复杂度只有线性复杂度,还是比较简单高效的
- 参数估计的时候由于有很多0,所以无法通过常规的方式进行参数估计,而是通过特征隐向量矩阵进行转换,解决了参数估计问题,而且一旦得到特征隐向量矩阵,就可以得到用户以及特征的对应embedding向量。
3 FM特征的实现样例
可以看到其实特征不仅仅是可以类别特征,也可以是数值特征,没有问题。同时在数据集中不能有缺失值,否则代码会报错。
另外,在进入模型之前,数据务必首先要转化为上述的形式!
FFM是在FM基础上增加了Field,进一步细分特征。
4 FM的代码实现
4.1 数据准备
数据来源于kaggle的一个比赛数据集,具体可以戳链接获取。
import pandas as pd
import xlearn as xl
from sklearn.model_selection import train_test_split
import json
import warnings
warnings.filterwarnings('ignore')
def convert_to_fm(df,type,numerics,categories,features,y_name):
currentcode = len(numerics)
catdict = {}
catcodes = {}
# Flagging categorical and numerical fields
for x in numerics:
catdict[x] = 0
for x in categories:
catdict[x] = 1
nrows = df.shape[0]
ncolumns = len(features)
with open('./data/' + str(type) + "_fm.txt", "w") as text_file:
# Looping over rows to convert each row to libffm format
for n, r in enumerate(range(nrows)):
datastring = ""
datarow = df.iloc[r].to_dict()
datastring += str(int(datarow[y_name])) # Set Target Variable here
# For numerical fields, we are creating a dummy field here
for i, x in enumerate(catdict.keys()):
if(catdict[x]==0):
datastring = datastring + " "+str(i)+":"+ str(datarow[x])
else:
# For a new field appearing in a training example
if(x not in catcodes):
catcodes[x] = {}
currentcode +=1
catcodes[x][datarow[x]] = currentcode #encoding the feature
# For already encoded fields
elif(datarow[x] not in catcodes[x]):
currentcode +=1
catcodes[x][datarow[x]] = currentcode #encoding the feature
code = catcodes[x][datarow[x]]
datastring = datastring + " "+str(int(code))+":1" # 因为分类型的直接就取值为1 所以直接加了
datastring += '\n'
text_file.write(datastring)
return catdict,catcodes
# 读取数据
df = pd.read_csv('train_u6lujuX_CVtuZ9i.csv')
print(df.shape)
df.head()
# 对y做一个变换
Loan_Status_dic = {'Y':1, 'N':0}
df['Loan_Status'] = df['Loan_Status'].map(Loan_Status_dic)
df = df.fillna(0)
# 切分数据
X_train, X_test = train_test_split(df, test_size = 0.3, random_state = 5)
print(X_train.shape)
print(X_test.shape)
# 训练集转换
catdict,catcodes = convert_to_fm(X_train,type='Train',numerics=['ApplicantIncome','Credit_History','CoapplicantIncome'],
categories=['Education','Property_Area'],features=['Education','ApplicantIncome','Credit_History','Property_Area','CoapplicantIncome'],
y_name='Loan_Status')
# 将列名数据存储
all_cols = []
for k,v in catdict.items():
all_cols.append(k)
# 导出所有列名
df_all_cols = pd.DataFrame({'cols':all_cols})
df_all_cols.to_csv('df_all_cols.csv', index=False, encoding = 'gbk')
# catcodes包为json
with open("categ_dic.json", "w") as f:
f.write(json.dumps(catcodes))
# 测试集转换
convert_to_fm(X_test,type='Test',numerics=['ApplicantIncome','Credit_History'],
categories=['Education'],features=['Education','ApplicantIncome','Credit_History'],y_name='Loan_Status')
(614, 13)
(429, 13)
(185, 13)
({'ApplicantIncome': 0, 'Credit_History': 0, 'Education': 1},
{'Education': {'Not Graduate': 3, 'Graduate': 4}})
4.2 主代码实现
4.2.1 安装xlearn
直接可以从官网下载xlearn的whl文件,然后直接pip install即可!
4.2.2 拟合模型
import xlearn as xl
fm_model = xl.create_fm()
fm_model.setTrain('./data/train_fm.txt')
# 参数
param = {'task':'binary',
'lr':0.1,
'lambda':0.02,
'metric':'auc',
'epoch':20
}
# 拟合模型
fm_model.fit(param, './cache_file/model.out')
4.2.3 预测并得到KS值
# 训练集
fm_model.setSigmoid()
fm_model.setTest('./data/Train_fm.txt')
fm_model.predict('./cache_file/model.out', './result/output_train.txt')
# 测试集
fm_model.setSigmoid()
fm_model.setTest('./data/Test_fm.txt')
fm_model.predict('./cache_file/model.out', './result/output_test.txt')
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn.metrics import accuracy_score
import pandas as pd
y_name='Loan_Status'
# 训练集ks
y_train_pred = pd.read_table('./result/output_train.txt', header=None, names=['v'])
y_train_pred = y_train_pred['v'].tolist()
X_train[y_name] = X_train[y_name].map(int)
y_train = X_train[y_name].tolist()
fpr2,tpr2,thresholds = metrics.roc_curve(y_train, y_train_pred)
ks_val = max(tpr2 - fpr2)
print('[训练集]KS为: %.4f ' % ks_val)
# 测试集ks
y_test_pred = pd.read_table('./result/output_test.txt', header=None, names=['v'])
y_test_pred = y_test_pred['v'].tolist()
X_test[y_name] = X_test[y_name].map(int)
y_test = X_test[y_name].tolist()
fpr2,tpr2,thresholds = metrics.roc_curve(y_test, y_test_pred)
ks_val = max(tpr2 - fpr2)
print('[测试集]KS为: %.4f ' % ks_val)
[训练集]KS为: 0.1165
[测试集]KS为: 0.0000
5 FM的优缺点
优点:
- 自动学习两个特征间的关系。避免人工去做特征工程
- 模型占得空间小。不需要输入那么多的交叉特征,所以产生的模型相对LR的模型会小很多
- 计算速度较快。在线计算时减小了交叉特征的拼装,在线计算的速度基本和LR持平
- 分类和数值特征都可以使用
- 通过引入辅助向量,能够解决稀疏数据的特征组合问题。
缺点:
- 无法学习三个及以上的特征间的关系,所以交叉特征选择的工作仍然无法避免。
- 可能效果没有深度网络好