DNNLinear组合分类器实战

在Census Income Data Set上训练

训练集

训练数据是人口普查收入数据集Census Income Data Set
该数据集包含48000条样本,其中属性有年龄(age)、职业(occupation)、教育(education)和收入(income)等,收入是二元标签,要不>50k要不<=50k。数据集大概分为32000条训练样本和16000条测试样本。
包含的属性如下:

字段取值描述
agecontinuous市民的年龄
fnlwgtcontinuous这个值表示受访者提供的消息的置信度
education-numcontinuous市民最高学历的数字形式
capital-gaincontinuous.资本利得记录
capital-losscontinuous.资本亏损记录
hours-per-weekcontinuous.每周工作时间
workclassPrivate, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked市民职位的所属类型(政府, 军队, 私人, 等等)
educationBachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.市民的最高学历
marital-statusMarried-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.市民的婚姻状况
occupationTech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.市民的职位
relationshipWife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.妻子, 孩子, 丈夫, 不在家庭里, 其他亲属, 未婚
raceWhite, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.白种人, 亚太岛民, 美洲-印度-爱斯基摩人, 其他, 黑种人
sexFemale, Male.女性,男性
native-countryUnited-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.市民的国家
income>50K, <=50K市民的年收入是否超过$ 50,000

训练读取的数据集是逗号分割,下载地址。
adult.data :训练集。
adult.test:测试集
adult.data打开效果如下,其中<=50K是标签:
在这里插入图片描述

特征

数据集都是原始特征,一些还是字符串格式,并不能直接传入模型,需要进行特征处理。
关于连续和分类数据的特征工程,可以参考《DNNLinear组合分类器的使用 & Feature column》里面的Feature column。
下面代码组合起来能够完整的进行一个模型的构建及训练、验证,如遇到什么问题欢迎留言讨论,首先进行简单的模块导入及参数设置,:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

import pandas as pd
from six.moves import urllib
import tensorflow as tf

# 原始数据是没有列名的,这里是特征的列名。
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"
]

对于上述数据集中连续数值特征(continuous),如age、education_num等直接处理:

# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

对于离散特征的处理如下:

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])
    
# To show an example of hashing:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# Transformations.
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

设置deep&wide的特征:

# Wide columns and deep columns.
base_columns = [
    gender, education, marital_status, relationship, workclass, occupation,
    native_country, age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]

deep_columns = [
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(gender),
    tf.feature_column.indicator_column(relationship),
    # To show an example of embedding
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
]

input_fn

input_fn是DNNLinear组合分类器一个特别重要的函数,本篇展示了两种input_fn方法,本人比较倾向于第一种方法。

def input_fn(data_file, num_epochs, shuffle):
  """Input builder function."""
  df_data = pd.read_csv(
      tf.gfile.Open(data_file),
      names=CSV_COLUMNS,
      skipinitialspace=True,
      engine="python",
      skiprows=1)
  # remove NaN elements
  df_data = df_data.dropna(how="any", axis=0)
  labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  return tf.estimator.inputs.pandas_input_fn(
      x=df_data,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)
def input_fn(data_file, num_epochs, shuffle, batch_size):
    """为Estimator创建一个input function"""
    assert tf.gfile.Exists(data_file), "{0} not found.".format(data_file)

    def parse_csv(line):
        print("Parsing", data_file)
        # tf.decode_csv会把csv文件转换成很a list of Tensor,一列一个。record_defaults用于指明每一列的缺失值用什么填充
        columns = tf.decode_csv(line, record_defaults=_CSV_COLUMN_DEFAULTS)
        features = dict(zip(_CSV_COLUMNS, columns))
        labels = features.pop('income_bracket')
        return features, tf.equal(labels, '>50K') # tf.equal(x, y) 返回一个bool类型Tensor, 表示x == y, element-wise

    dataset = tf.data.TextLineDataset(data_file) \
                .map(parse_csv, num_parallel_calls=5)
# dataframe转tensor https://cloud.tencent.com/developer/ask/135418
# dataframe  转 tensorflow
    if shuffle:
        dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'] + _NUM_EXAMPLES['validation'])

    dataset = dataset.repeat(num_epochs)
    dataset = dataset.batch(batch_size)

    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

model

def build_estimator(model_dir, model_type):
  """Build an estimator."""
  if model_type == "wide":
    m = tf.estimator.LinearClassifier(
        model_dir=model_dir, feature_columns=base_columns + crossed_columns)
  elif model_type == "deep":
    m = tf.estimator.DNNClassifier(
        model_dir=model_dir,
        feature_columns=deep_columns,
        hidden_units=[100, 50])
  else:
    m = tf.estimator.DNNLinearCombinedClassifier(
        model_dir=model_dir,
        linear_feature_columns=crossed_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[100, 50])
  return m

train_and_eval

def train_and_eval(model_dir, model_type, train_steps, train_data, test_data):
  """Train and evaluate the model."""
  #train_file_name, test_file_name = maybe_download(train_data, test_data)
  train_file_name = './data/adult.data'
  test_file_name = './data/adult.test'
  #model_dir = tempfile.mkdtemp() if not model_dir else model_dir

  m = build_estimator(model_dir, model_type)
  # set num_epochs to None to get infinite stream of data.
  m.train(
      input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
      steps=train_steps)
  # set steps to None to run evaluation until all data consumed.
  results = m.evaluate(
      input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
      steps=None)
  print("model directory = %s" % model_dir)
  for key in sorted(results):
    print("%s: %s" % (key, results[key]))

main

model_dir = './model2/wide_deep'
def main(_):
  train_and_eval(model_dir, FLAGS.model_type, FLAGS.train_steps,
                 FLAGS.train_data, FLAGS.test_data)

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  parser.add_argument(
      "--model_dir",
      type=str,
      default="",
      help="Base directory for output models."
  )
  parser.add_argument(
      "--model_type",
      type=str,
      default="wide_n_deep",
      help="Valid model types: {'wide', 'deep', 'wide_n_deep'}."
  )
  parser.add_argument(
      "--train_steps",
      type=int,
      default=2000,
      help="Number of training steps."
  )
  parser.add_argument(
      "--train_data",
      type=str,
      default="",
      help="Path to the training data."
  )
  parser.add_argument(
      "--test_data",
      type=str,
      default="",
      help="Path to the test data."
  )
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

在自己数据集上训练

在自己数据上进行训练,其实只需要改变的是input_fn函数,这里的数据是一个hive表,存在hdfs上的数据,数据格式是:feature1,feature2,feature3,…,label。这里使用的平台是pyspark,利用spark.sql读取hive表,转为pandas的DataFrame。如果你的数据是pandas的DataFrame或者是numpy其实都是可以的。这里存在的隐患是:如果数据是千万级,而每个样本特征维度是百维以上,那么转pandas的时候存在内存溢出,下一步尝试,直接读取hdfs数据。

def input_fn(num_epochs, shuffle):
    feature_list_sql = "select \
                base_sex,base_age,base_edu, \
                base_marry,base_profession,base_city_level, \
                browse_times_15_cate,browse_times_30_cate,browse_times_60_cate,browse_times_90_cate, \
                label \
                from X_X_feature_v1 \
                where id_ ='92' and train_flag='1'"
    feature_df = spark.sql(feature_list_sql)
    feature_df_pd = feature_df.limit(10000).toPandas()
    feature_df_pd['label'] = feature_df_pd['label'].astype("int")
    feature_df_pd['base_age'] = feature_df_pd['base_age'].astype("int")
    feature_df_pd = feature_df_pd.dropna(how="any", axis=0)
    labels = feature_df_pd['label']
    #labels = feature_df_pd.pop('label')
    print('******************************************')
    print(feature_df_pd.columns)
    print('******************************************')
    
    return tf.estimator.inputs.pandas_input_fn(
      x=feature_df_pd,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)
    #labels = feature_df_pd["label"].apply(lambda x: ">50K" in x).astype(int)

参考:
https://blog.csdn.net/u013608336/article/details/78031788
https://www.jianshu.com/p/6868fc1f65d0

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值