Tensorflow 特征列说明

mingchen_peng

于 2024-08-22 14:55:56 发布

阅读量247

点赞数 13

分类专栏： Tensorflow 文章标签： tensorflow 人工智能

本文链接：https://blog.csdn.net/mingchen_peng/article/details/141427806

版权

Tensorflow 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

总结以下类型的用法
tf.feature_column.numeric_column、
tf.feature_column.bucketized_column、
tf.feature_column.categorical_column_with_hash_bucket、
tf.feature_column.embedding_column、
tf.feature_column.shared_embedding_columns、
tf.feature_column.indicator_column、
tf.feature_column.crossed_column

以下是对每个特征列的详细说明，包括它们的作用、适用场景、示例，以及在 Wide & Deep 模型中的应用侧。

1. `tf.feature_column.numeric_column`

作用: 表示数值特征，适用于连续值特征。
应用侧: 通常用于 Deep 侧。
适用场景:
- 例子 1: 房价预测中的面积（如平方英尺）。
- 例子 2: 用户行为分析中的浏览时间（如秒数）。

示例:

area = tf.feature_column.numeric_column("area")  # 面积
browse_time = tf.feature_column.numeric_column("browse_time")  # 浏览时间

2. `tf.feature_column.bucketized_column`

作用: 将数值特征分桶（离散化），将连续值映射到离散的区间。
应用侧: 可以用于 Wide 侧和 Deep 侧。
适用场景:
- 例子 1: 将用户年龄分段（如18-25、26-35等）。
- 例子 2: 将收入分段（如低收入、中等收入、高收入）。

示例:

age = tf.feature_column.numeric_column("age")
age_buckets = tf.feature_column.bucketized_column(age, boundaries=[18, 25, 35, 45, 55, 65])  # 年龄分段

3. `tf.feature_column.categorical_column_with_hash_bucket`

作用: 表示高基数的分类特征，通过哈希函数将特征值映射到固定数量的桶中。
应用侧: 通常用于 Wide 侧。
适用场景:
- 例子 1: 用户ID（如“user_12345”）。
- 例子 2: 产品ID（如“product_67890”）。

示例:

user_id = tf.feature_column.categorical_column_with_hash_bucket("user_id", hash_bucket_size=10000)  # 用户ID
product_id = tf.feature_column.categorical_column_with_hash_bucket("product_id", hash_bucket_size=5000)  # 产品ID

4. `tf.feature_column.embedding_column`

作用: 将分类特征映射到低维稠密向量（嵌入向量），通常用于深度学习模型。
应用侧: 通常用于 Deep 侧。
适用场景:
- 例子 1: 推荐系统中的用户嵌入表示。
- 例子 2: 自然语言处理中的词嵌入。

示例:

user_id = tf.feature_column.categorical_column_with_hash_bucket("user_id", hash_bucket_size=10000)
user_id_embedding = tf.feature_column.embedding_column(user_id, dimension=16)  # 用户ID嵌入

5. `tf.feature_column.shared_embedding_columns`

作用: 将多个分类特征共享同一个嵌入向量空间，适用于多个特征具有相似的语义空间。
应用侧: 通常用于 Deep 侧。
适用场景:
- 例子 1: 多语言文本处理中的词嵌入。
- 例子 2: 推荐系统中用户和物品的共同嵌入。

示例:

user_id = tf.feature_column.categorical_column_with_hash_bucket("user_id", hash_bucket_size=10000)
product_id = tf.feature_column.categorical_column_with_hash_bucket("product_id", hash_bucket_size=5000)
shared_embedding = tf.feature_column.shared_embedding_columns([user_id, product_id], dimension=16)  # 共享嵌入

6. `tf.feature_column.indicator_column`

作用: 将分类特征转换为稀疏的独热编码（one-hot encoding），用于表示每个类别的存在与否。
应用侧: 通常用于 Wide 侧。
适用场景:
- 例子 1: 处理低基数的分类特征，如性别（“male”、“female”）。
- 例子 2: 颜色特征（“red”、“green”、“blue”）。

示例:

gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["male", "female"])  # 性别
gender_indicator = tf.feature_column.indicator_column(gender)  # 性别独热编码

7. `tf.feature_column.crossed_column`

作用: 创建交叉特征，用于捕捉多个特征之间的相互关系。
应用侧: 通常用于 Wide 侧。
适用场景:
- 例子 1: 捕捉用户特征和产品特征之间的关系（如用户的地理位置和购买的产品）。
- 例子 2: 组合地理位置和时间段的特征（如“北京-早上”）。

示例:

location = tf.feature_column.categorical_column_with_hash_bucket("location", hash_bucket_size=1000)  # 地理位置
time_of_day = tf.feature_column.categorical_column_with_vocabulary_list("time_of_day", ["morning", "afternoon", "evening", "night"])  # 时间段
location_time_cross = tf.feature_column.crossed_column([location, time_of_day], hash_bucket_size=10000)  # 交叉特征