Wide&Deep原理及实践

最新推荐文章于 2024-07-07 21:59:24 发布

卓玛cug

最新推荐文章于 2024-07-07 21:59:24 发布

阅读量751

点赞数

分类专栏：推荐系统

本文链接：https://blog.csdn.net/qq_29153321/article/details/104040794

版权

推荐系统专栏收录该内容

16 篇文章 2 订阅

订阅专栏

背景

根据推荐系统使用数据的不同，推荐算法可分为基于用户行为推荐、基于内容推荐等。主流的推荐系统算法可以分为协同过滤推荐（Collaborative Filtering Recommendation）、基于内容推荐（Content-basedRecommendation）和混合推荐三种。混合推荐一般有UserCF、ItemCF、热度推荐、时效推荐、历史阅读推荐、用户爱好推荐等方法。
推荐排序方法一般有：gbrt+lr、Wide&Deep、DeepFM、YouTube推荐（发展历程按顺序）推荐这几种方法，目前YouTube推荐方法最热，但是很少人能够应用到实践中并取得良好的效果，目前Wide&Deep、DeepFM被反馈应用较成熟，因此本篇文章主要研究Wide&Deep的应用。

原理

Wide&Deep推荐算法出自一篇论文《Wide&Deep Learning for RecommenderSystems》，
提出W&D模型，平衡Wide模型和Deep模型的记忆能力和泛化能力。实际上是lr+dnn。
记忆（memorization）通过特征叉乘对原始特征做非线性变换，输入为高维度的稀疏向量。通过大量的特征叉乘产生特征相互作用的“记忆（Memorization）”，高效且可解释，但要泛化需要更多的特征工程。

泛化（generalization）只需要少量的特征工程，深度神经网络通过embedding的方法，使用低维稠密特征输入，可以更好地泛化训练样本中未出现过的特征组合。但当user-item交互矩阵稀疏且高阶时，容易出现“过泛化（over-generalize）”导致推荐的item相关性差。
参考：https://blog.csdn.net/zhangbaoanhadoop/article/details/81608947
在这里插入图片描述

实践

参考https://github.com/tensorflow/models/tree/master/official/r1/wide_deep
不得不说这个谷歌的项目真香。

1.数据集准备
Census Income Data Set

python census_dataset.py

下载到/tmp/census_data，–data_dir设置路径

特征处理：
离散特征处理分为两种情况：
知道所有的不同取值，而且取值不多。tf.feature_column.categorical_column_with_vocabulary_list
不知道所有不同取值，或者取值非常多。tf.feature_column.categorical_column_with_hash_bucket
原始连续特征：tf.feature_column.numeric_column
规范化到[0,1]的连续特征：tf.feature_column.bucketized_column

# 连续特征
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

# 离散特征
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
        'Other-relative'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

# 离散hash bucket特征
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=_HASH_BUCKET_SIZE)

# 特征Transformations
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

2.训练

python census_main.py

模型存储到/tmp/census_model，–model_dir设置路径

3.可视化

tensorboard --logdir=/tmp/census_model

4.预测

python census_main.py --export_dir /tmp/wide_deep_saved_model

训练时模型导出为Tensorflow SavedModel格式。

linix预测运行：

saved_model_cli run --dir /tmp/wide_deep_saved_model/${TIMESTAMP}/ \
--tag_set serve --signature_def="predict" \
--input_examples='examples=[{"age":[46.], "education_num":[10.], "capital_gain":[7688.], "capital_loss":[0.], "hours_per_week":[38.]}, {"age":[24.], "education_num":[13.], "capital_gain":[0.], "capital_loss":[0.], "hours_per_week":[50.]}]'

windows预测运行：
由于Windows将单引号视为输入的一部分，将双引号改为单引号，单引号转为双引号

saved_model_cli run --dir /My Directory/ --tag_set serve --signature_def="predict" --input_examples="examples=[{'age':[46.], 'education_num':[10.], 'capital_gain':[7688.], 'capital_loss':[0.], 'hours_per_week':[38.]}, {'age':[24.], 'education_num':[13.], 'capital_gain':[0.], 'capital_loss':[0.], 'hours_per_week':[50.]}]"