tf.feature_columns api
对官方的api进行了下简单的整理,方便一次性看,内容基本都是搬自tensorflow的官方api,大家可以到tensorflow官方api上查看更加详细的内容。
tensorflow的特征处理api
tensorflow提供了feature_columns api来处理特征,基本上涵盖了我们所有使用到的特征类别,下面一一简单列举。
Bucketized_column
tf.feature_column.bucketized_column(
source_column,
boundaries
)
将连续特征根据boundaries分桶为对应的离散区间,相当于人工设定区间进行离散化操作,如下:
boundaries = [0, 10, 100]
input tensor = [[-5, 10000]
[150, 10]
[5, 100]]
output = [[0, 3]
[3, 2]
[1, 3]]
categorical_column_with_hash_bucket
将离散特征通过hash_bucket进行离散操作,例如用户的id非常多,可能高达上亿,那么生成上亿个id会为模型带来非常大的负载,这里就可以使用hash进行分桶,但是这样带来的问题是会出现非常多的hash冲突。
tf.feature_column.categorical_column_with_hash_bucket(
key,
hash_bucket_size,# output_id = Hash(input_feature_string) % bucket_size
dtype=tf.string # 要求是string or int格式,当输入的值是null时,string返回‘’做hash,int返回-1
)
categorical_column_with_identity
离散特征,直接用id号进行了标识
tf.feature_column.categorical_column_with_identity(
key,
num_buckets,
default_value=None
)
这里需要制定num_buckets,然后输入的范围需要在range(0,num_buckets)内,当值超过了范围时需要使用default_value来进行填充,例如我们已经自行进行过了编号
categorical_column_with_vocabulary_file
对于输入,利用给定的字典文件来对离散特征进行编码
tf.feature_column.categorical_column_with_vocabulary_file(
key,
vocabulary_file,
vocabulary_size=None,
num_oov_buckets=0,
default_value=None,
dtype=tf.string
)
这个应该不会用,所以简单跳过
categorical_column_with_vocabulary_list
tf.feature_column.categorical_column_with_vocabulary_list(
key,
vocabulary_list,
dtype=None,
default_value=-1,
num_oov_buckets=0
)
上一个是利用字典文件来对输入进行映射,这边是使用字典列表,即将可能的值全部加载入内存中,例如性别,应该使用也不多。
crossed_column
下面一个是交叉列,即将前面的categorical_column进行交叉组合生成新的特征列
tf.feature_column.crossed_column(
keys,
hash_bucket_size,
hash_key=None
)
交叉列的实现是基于hash进行分桶的
embedding_column
下面这个是在神经网络中使用非常之多的embedding处理方案,用于处理离散特征
tf.feature_column.embedding_column(
categorical_column,#离散列
dimension,#嵌入维度
combiner='mean',#pooling方式
initializer=None,#初始化方案,defaults to tf.truncated_normal_initializer with mean 0.0 and standard deviation 1/sqrt(dimension)
ckpt_to_load_from=None,
tensor_name_in_ckpt=None,
max_norm=None,#正则化
trainable=True
)
video_id = categorical_column_with_identity(
key='video_id', num_buckets=1000000, default_value=0)
columns = [embedding_column(video_id, 9),...]
estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)
label_column = ...
def input_fn():
features = tf.parse_example(
..., features=make_parse_example_spec(columns + [label_column]))
labels = features.pop(label_column.name)
return features, labels
estimator.train(input_fn=input_fn, steps=100)
indicator_column
tf.feature_column.indicator_column(categorical_column)
将给定的离散列转为multi-hot编码
name = indicator_column(categorical_column_with_vocabulary_list(
'name', ['bob', 'george', 'wanda'])
columns = [name, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
dense_tensor == [[1, 0, 0]] # If "name" bytes_list is ["bob"]
dense_tensor == [[1, 0, 1]] # If "name" bytes_list is ["bob", "wanda"]
dense_tensor == [[2, 0, 0]] # If "name" bytes_list is ["bob", "bob"]
input_layer
#Returns a dense Tensor as input layer based on given feature_columns.
tf.feature_column.input_layer(
features,#A mapping from key to tensors
feature_columns,#需要是数值类列表,可以是numeric_column、embedding_column、indicator_column、bucketized_column等,如果是离散列,可以通过embedding_column或者indicator_column来进行转换
weight_collections=None,
trainable=True,
cols_to_vars=None
)
price = numeric_column('price')
keywords_embedded = embedding_column(
categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
columns = [price, keywords_embedded, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
for units in [128, 64, 32]:
dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.layers.dense(dense_tensor, 1)
numeric_column
tf.feature_column.numeric_column(
key,
shape=(1,),
default_value=None,
dtype=tf.float32,
normalizer_fn=None
)
数值型特征,比较简单
shared_embedding_columns
tf.feature_column.shared_embedding_columns(
categorical_columns,
dimension,
combiner='mean',
initializer=None,
shared_embedding_collection_name=None,
ckpt_to_load_from=None,
tensor_name_in_ckpt=None,
max_norm=None,
trainable=True
)
用于共享embedding的场景,例如用户当前预测的广告id与用户的广告id列表
watched_video_id = categorical_column_with_vocabulary_file(
'watched_video_id', video_vocabulary_file, video_vocabulary_size)
impression_video_id = categorical_column_with_vocabulary_file(
'impression_video_id', video_vocabulary_file, video_vocabulary_size)
columns = shared_embedding_columns(
[watched_video_id, impression_video_id], dimension=10)
estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)
label_column = ...
def input_fn():
features = tf.parse_example(
..., features=make_parse_example_spec(columns + [label_column]))
labels = features.pop(label_column.name)
return features, labels
estimator.train(input_fn=input_fn, steps=100)