tf.feature_columns api

最新推荐文章于 2023-07-27 11:24:18 发布

知天易or逆天难

最新推荐文章于 2023-07-27 11:24:18 发布

阅读量775

点赞数 1

分类专栏： AI

本文链接：https://blog.csdn.net/u013019431/article/details/102639615

版权

AI 专栏收录该内容

22 篇文章 3 订阅

订阅专栏

tf.feature_columns api

对官方的api进行了下简单的整理，方便一次性看，内容基本都是搬自tensorflow的官方api，大家可以到tensorflow官方api上查看更加详细的内容。

tensorflow的特征处理api

tensorflow提供了feature_columns api来处理特征，基本上涵盖了我们所有使用到的特征类别，下面一一简单列举。

Bucketized_column

tf.feature_column.bucketized_column(
    source_column,
    boundaries
)

将连续特征根据boundaries分桶为对应的离散区间，相当于人工设定区间进行离散化操作，如下：

boundaries = [0, 10, 100]
input tensor = [[-5, 10000]
                [150,   10]
                [5,    100]]

output = [[0, 3]
          [3, 2]
          [1, 3]]

categorical_column_with_hash_bucket

将离散特征通过hash_bucket进行离散操作，例如用户的id非常多，可能高达上亿，那么生成上亿个id会为模型带来非常大的负载，这里就可以使用hash进行分桶，但是这样带来的问题是会出现非常多的hash冲突。

tf.feature_column.categorical_column_with_hash_bucket(
    key,
    hash_bucket_size,# output_id = Hash(input_feature_string) % bucket_size
    dtype=tf.string # 要求是string or int格式，当输入的值是null时，string返回‘’做hash，int返回-1
)

categorical_column_with_identity

离散特征，直接用id号进行了标识

tf.feature_column.categorical_column_with_identity(
    key,
    num_buckets,
    default_value=None
)

这里需要制定num_buckets,然后输入的范围需要在range(0,num_buckets)内，当值超过了范围时需要使用default_value来进行填充，例如我们已经自行进行过了编号

categorical_column_with_vocabulary_file

对于输入，利用给定的字典文件来对离散特征进行编码

tf.feature_column.categorical_column_with_vocabulary_file(
    key,
    vocabulary_file,
    vocabulary_size=None,
    num_oov_buckets=0,
    default_value=None,
    dtype=tf.string
)

这个应该不会用，所以简单跳过

categorical_column_with_vocabulary_list

tf.feature_column.categorical_column_with_vocabulary_list(
    key,
    vocabulary_list,
    dtype=None,
    default_value=-1,
    num_oov_buckets=0
)

上一个是利用字典文件来对输入进行映射，这边是使用字典列表，即将可能的值全部加载入内存中，例如性别，应该使用也不多。

crossed_column

下面一个是交叉列，即将前面的categorical_column进行交叉组合生成新的特征列

tf.feature_column.crossed_column(
    keys,
    hash_bucket_size,
    hash_key=None
)

交叉列的实现是基于hash进行分桶的

embedding_column

下面这个是在神经网络中使用非常之多的embedding处理方案，用于处理离散特征

tf.feature_column.embedding_column(
    categorical_column,#离散列
    dimension,#嵌入维度
    combiner='mean',#pooling方式
    initializer=None,#初始化方案，defaults to tf.truncated_normal_initializer with mean 0.0 and standard deviation 1/sqrt(dimension)
    ckpt_to_load_from=None,
    tensor_name_in_ckpt=None,
    max_norm=None,#正则化
    trainable=True
)

video_id = categorical_column_with_identity(
    key='video_id', num_buckets=1000000, default_value=0)
columns = [embedding_column(video_id, 9),...]

estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)

label_column = ...
def input_fn():
  features = tf.parse_example(
      ..., features=make_parse_example_spec(columns + [label_column]))
  labels = features.pop(label_column.name)
  return features, labels

estimator.train(input_fn=input_fn, steps=100)

indicator_column

tf.feature_column.indicator_column(categorical_column)

将给定的离散列转为multi-hot编码

name = indicator_column(categorical_column_with_vocabulary_list(
    'name', ['bob', 'george', 'wanda'])
columns = [name, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)

dense_tensor == [[1, 0, 0]]  # If "name" bytes_list is ["bob"]
dense_tensor == [[1, 0, 1]]  # If "name" bytes_list is ["bob", "wanda"]
dense_tensor == [[2, 0, 0]]  # If "name" bytes_list is ["bob", "bob"]

input_layer

#Returns a dense Tensor as input layer based on given feature_columns.
tf.feature_column.input_layer(
    features,#A mapping from key to tensors
    feature_columns,#需要是数值类列表，可以是numeric_column、embedding_column、indicator_column、bucketized_column等，如果是离散列，可以通过embedding_column或者indicator_column来进行转换
    weight_collections=None,
    trainable=True,
    cols_to_vars=None
)

price = numeric_column('price')
keywords_embedded = embedding_column(
    categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
columns = [price, keywords_embedded, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
for units in [128, 64, 32]:
  dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.layers.dense(dense_tensor, 1)

numeric_column

tf.feature_column.numeric_column(
    key,
    shape=(1,),
    default_value=None,
    dtype=tf.float32,
    normalizer_fn=None
)

数值型特征，比较简单

shared_embedding_columns

tf.feature_column.shared_embedding_columns(
    categorical_columns,
    dimension,
    combiner='mean',
    initializer=None,
    shared_embedding_collection_name=None,
    ckpt_to_load_from=None,
    tensor_name_in_ckpt=None,
    max_norm=None,
    trainable=True
)

用于共享embedding的场景，例如用户当前预测的广告id与用户的广告id列表

watched_video_id = categorical_column_with_vocabulary_file(
    'watched_video_id', video_vocabulary_file, video_vocabulary_size)
impression_video_id = categorical_column_with_vocabulary_file(
    'impression_video_id', video_vocabulary_file, video_vocabulary_size)
columns = shared_embedding_columns(
    [watched_video_id, impression_video_id], dimension=10)

estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)

label_column = ...
def input_fn():
  features = tf.parse_example(
      ..., features=make_parse_example_spec(columns + [label_column]))
  labels = features.pop(label_column.name)
  return features, labels

estimator.train(input_fn=input_fn, steps=100)