tensorflow 特征工程

最新推荐文章于 2021-07-22 22:14:06 发布

Lzj000lzj

最新推荐文章于 2021-07-22 22:14:06 发布

阅读量1k

点赞数

分类专栏： tensorflow 文章标签：特征工程

tensorflow 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

https://blog.csdn.net/u014021893/article/details/80423112
https://blog.csdn.net/u014061630/article/details/82937333
https://blog.csdn.net/qq_22238533/article/details/78980319
https://blog.csdn.net/cjopengler/article/details/78161748

from tensorflow import feature_column
from tensorflow.keras import layers

在Tensorflow中，通过调用tf.feature_column模块来创建feature columns。有两大类feature column，一类是生成连续特征dense tensor的Dense Column；另一类是生成离散特征sparse tensor的Categorical Column。
在这里插入图片描述

Numeric column

tf.feature_column.numeric_column(
    key,
    shape=(1,),
    default_value=None,
    dtype=tf.float32,
    normalizer_fn=None
)
key: 特征的名字。也就是对应的列名称。
shape: 该key所对应的特征的shape. 默认是1，但是比如one-hot类型的，shape就不是1，而是实际的维度。总之，这里是key所对应的维度，不一定是1.
default_value: 如果不存在使用的默认值
normalizer_fn: 对该特征下的所有数据进行转换。如果需要进行normalize，那么就是使用normalize的函数.这里不仅仅局限于normalize，也可以是任何的转换方法，比如取对数，取指数，这仅仅是一种变换方法.

如何用Numeric column处理特征并查看处理之后的效果
并自定义normalizer_fn方法

def test_numeric():
    price = {'price': [[1.], [2.], [3.], [4.]]}  # 4行样本
    def transform_fn(x):
        return x + 2
    price_column = feature_column.numeric_column('price', normalizer_fn=transform_fn)#这里的price必须与传入数据的列名保持对应，否则找不到处理的数据列
    feature_layer = tf.keras.layers.DenseFeatures(price_column)#根据处理后的一个特征构建网络层
    print(feature_layer(price).numpy())#在网络层中输出预处理后的数据
test_numeric()
输出：
[[3.]
 [4.]
 [5.]
 [6.]]

indicator_column和embedding column

在这里插入图片描述

Indicator columns 和 embedding columns 不能直接作用在类别特征上，而是作用在categorical columns的输出上。
indicator_column将特征列进行独热编码
当某些特征的类别数量非常大时，使用indicator_column来把原始数据转换为神经网络的输入就变得非常不灵活，特征维度会非常大，这时通常使用embedding column把原始特征映射为一个低维稠密的实数向量。同一类别的embedding向量间的距离通常可以用来度量类别直接的相似性。

bucketized_column

tf.feature_column.bucketized_column(
    source_column,
    boundaries
)
source_column: 必须是numeric_column
boundaries: 不同的桶。boundaries=[0., 1., 2.],产生的bucket就是, (-inf, 0.), [0., 1.), [1., 2.), and [2., +inf), 每一个区间分别表示0, 1, 2, 3,所以相当于分桶分了4个.

Categorical vocabulary column

Categorical vocabulary column把一个vocabulary中的string映射为数值型的类别特征，是做one-hot编码的很好的方法。在tensorflow中有两种提供词汇表的方法，一种是用list，另一种是用file，对应的feature column分别为：

tf.feature_column.categorical_column_with_vocabulary_list
*tf.feature_column.categorical_column_with_vocabulary_file
两者的定义如下：

tf.feature_column.categorical_column_with_vocabulary_list(
    key,
    vocabulary_list,
    dtype=None,
    default_value=-1,
    num_oov_buckets=0
)
key: feature名字
vocabulary_list: 对于category来说，进行转换的list.也就是category列表.
dtype: 仅仅string和int被支持，其他的类型是无法进行这个操作的.
default_value: 当不在vocabulary_list中的默认值，这时候num_oov_buckets必须是0.
num_oov_buckets: 用来处理那些不在vocabulary_list中的值，如果是0，那么使用default_value进行填充;如果大于0，则会在[len(vocabulary_list), len(vocabulary_list)+num_oov_buckets]这个区间上重新计算当前特征的值.


tf.feature_column.categorical_column_with_vocabulary_file(
    key,
    vocabulary_file,
    vocabulary_size=None,
    num_oov_buckets=0,
    default_value=None,
    dtype=tf.string
)
vocabulary_file: 存储词汇表的文件名
其他参数的含义与tf.feature_column.categorical_column_with_vocabulary_list相同

在这里插入图片描述
categorical_column_with_vocabulary_list

def test_categorical_column_with_vocabulary_list():
    color_data = {'color': [['R', 'R'], ['G', 'R'], ['B', 'G'], ['A', 'A']]}  # 4行样本
    color_column = feature_column.categorical_column_with_vocabulary_list(
        'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
    )
    # 将稀疏的转换成dense，也就是one-hot形式，只是multi-hot
    color_column_identy = feature_column.indicator_column(color_column)
    feature_layer = tf.keras.layers.DenseFeatures(color_column_identy)#根据处理后的一个特征构建网络层
    print(feature_layer(color_data).numpy())#在网络层中输出预处理后的数据
test_categorical_column_with_vocabulary_list()

输出
[[2. 0. 0.]
 [1. 1. 0.]
 [0. 1. 1.]
 [0. 0. 0.]]

Crossed column

交叉组合特征也是一种很常用的特征工程手段，尤其是在使用LR模型时。Crossed column仅仅适用于sparser特征，产生的依然是sparsor特征。

tf.feature_column.crossed_column(
    keys,
    hash_bucket_size,
    hash_key=None
)

def test_crossed_column():
    """ crossed column测试 """
    featrues = {
        'price': [['A'], ['B'], ['C']],
        'color': [['R'], ['G'], ['B']]
    }
    price = feature_column.categorical_column_with_vocabulary_list('price', ['A', 'B', 'C', 'D'])
    color = feature_column.categorical_column_with_vocabulary_list('color', ['R', 'G', 'B'])
    p_x_c = feature_column.crossed_column([price, color], 16)#16是输出特征维度，crossed_column产生的特征依然是离散的
    #对未预处理过的列进行交叉
    #p_x_c = tf.feature_column.crossed_column( ['price', 'color'], hash_bucket_size=16)
    p_x_c_identy = feature_column.indicator_column(p_x_c)#将离散特征onehot
    feature_layer = tf.keras.layers.DenseFeatures(p_x_c_identy)#根据处理后的一个特征构建网络层
    print(feature_layer(featrues).numpy())#在网络层中输出预处理后的数据

test_crossed_column()
输出：
[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]

给出一种按batch查看特征处理结果的方法

数据预处理

URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
dataframe.head()
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
#构造输入pipeline
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe1 = dataframe.copy()#复制出一个副本。使pop不影响到原本的df
    labels = dataframe1.pop('target')#提取标签
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe1), labels))
    #from_tensor_slices的参数(dict(df的多个列)，df的一个列)，形成一组一组的切片张量{（多个特征，target），（多个特征，target）....}
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe1))#打乱数据
    ds = ds.batch(batch_size)#将ds数据集按照batch_size大小组合成一个batch
    return ds

batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

提取example_batch

#batch样例用于可视化部分数据
example_batch = next(iter(train_ds))[0]
print(example_batch)
#可视化数据处理结果的函数
def demo(feature_column):
    feature_layer = layers.DenseFeatures(feature_column)#构建网络层，构建过程中与数据无关
    print(feature_layer(example_batch).numpy())#层(batch)得到的是这一个输入batch的样例数据（只与example_batch的数据有关）

各种预处理方法的特征处理可视化过程

#数字列
age = feature_column.numeric_column("age")#此时传入的age列名是用在的demo函数中的print输出example_batch的指定列
demo(age)
#桶列
age_buckets = feature_column.bucketized_column(age, boundaries=[
    18, 25, 30, 35, 40, 50])
demo(age_buckets)
#类别列
thal = feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)
#嵌入列
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)
#交叉功能列，适用于sparser特征,通过将多个特征组合为一个特征（称为特征组合），模型可学习每个特征组合的单独权重。
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)#using 1000 hash bins.
demo(feature_column.indicator_column(crossed_feature))

拼接特征

feature_columns = []
#拼接各类特征，作为模型的输入，构建网络架构时数据未传入，但是要指定处理的列名
# numeric cols
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
    feature_columns.append(feature_column.numeric_column(header))

# bucketized cols
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# indicator cols
thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# embedding cols
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# crossed cols
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

构建特征层

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

数据准备

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

构建模型训练并测试

model = tf.keras.Sequential([
    feature_layer,
    layers.Dense(128, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])
model.fit(train_ds, validation_data=val_ds,epochs=5)
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)