在Census Income Data Set上训练
训练集
训练数据是人口普查收入数据集Census Income Data Set
该数据集包含48000条样本,其中属性有年龄(age)、职业(occupation)、教育(education)和收入(income)等,收入是二元标签,要不>50k要不<=50k。数据集大概分为32000条训练样本和16000条测试样本。
包含的属性如下:
字段 | 取值 | 描述 |
---|---|---|
age | continuous | 市民的年龄 |
fnlwgt | continuous | 这个值表示受访者提供的消息的置信度 |
education-num | continuous | 市民最高学历的数字形式 |
capital-gain | continuous. | 资本利得记录 |
capital-loss | continuous. | 资本亏损记录 |
hours-per-week | continuous. | 每周工作时间 |
workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked | 市民职位的所属类型(政府, 军队, 私人, 等等) |
education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. | 市民的最高学历 |
marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. | 市民的婚姻状况 |
occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. | 市民的职位 |
relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. | 妻子, 孩子, 丈夫, 不在家庭里, 其他亲属, 未婚 |
race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. | 白种人, 亚太岛民, 美洲-印度-爱斯基摩人, 其他, 黑种人 |
sex | Female, Male. | 女性,男性 |
native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. | 市民的国家 |
income | >50K, <=50K | 市民的年收入是否超过$ 50,000 |
训练读取的数据集是逗号分割,下载地址。
adult.data :训练集。
adult.test:测试集
adult.data打开效果如下,其中<=50K是标签:
特征
数据集都是原始特征,一些还是字符串格式,并不能直接传入模型,需要进行特征处理。
关于连续和分类数据的特征工程,可以参考《DNNLinear组合分类器的使用 & Feature column》里面的Feature column。
下面代码组合起来能够完整的进行一个模型的构建及训练、验证,如遇到什么问题欢迎留言讨论,首先进行简单的模块导入及参数设置,:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import sys
import tempfile
import pandas as pd
from six.moves import urllib
import tensorflow as tf
# 原始数据是没有列名的,这里是特征的列名。
CSV_COLUMNS = [
"age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "gender",
"capital_gain", "capital_loss", "hours_per_week", "native_country",
"income_bracket"
]
对于上述数据集中连续数值特征(continuous),如age、education_num等直接处理:
# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
对于离散特征的处理如下:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
"education", [
"Bachelors", "HS-grad", "11th", "Masters", "9th",
"Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
"Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
"Preschool", "12th"
])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
"marital_status", [
"Married-civ-spouse", "Divorced", "Married-spouse-absent",
"Never-married", "Separated", "Married-AF-spouse", "Widowed"
])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
"relationship", [
"Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
"Other-relative"
])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
"workclass", [
"Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
"Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
])
# To show an example of hashing:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
"native_country", hash_bucket_size=1000)
# Transformations.
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
设置deep&wide的特征:
# Wide columns and deep columns.
base_columns = [
gender, education, marital_status, relationship, workclass, occupation,
native_country, age_buckets,
]
crossed_columns = [
tf.feature_column.crossed_column(
["education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column(
[age_buckets, "education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column(
["native_country", "occupation"], hash_bucket_size=1000)
]
deep_columns = [
tf.feature_column.indicator_column(workclass),
tf.feature_column.indicator_column(education),
tf.feature_column.indicator_column(gender),
tf.feature_column.indicator_column(relationship),
# To show an example of embedding
tf.feature_column.embedding_column(native_country, dimension=8),
tf.feature_column.embedding_column(occupation, dimension=8),
age,
education_num,
capital_gain,
capital_loss,
hours_per_week,
]
input_fn
input_fn是DNNLinear组合分类器一个特别重要的函数,本篇展示了两种input_fn方法,本人比较倾向于第一种方法。
def input_fn(data_file, num_epochs, shuffle):
"""Input builder function."""
df_data = pd.read_csv(
tf.gfile.Open(data_file),
names=CSV_COLUMNS,
skipinitialspace=True,
engine="python",
skiprows=1)
# remove NaN elements
df_data = df_data.dropna(how="any", axis=0)
labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
return tf.estimator.inputs.pandas_input_fn(
x=df_data,
y=labels,
batch_size=100,
num_epochs=num_epochs,
shuffle=shuffle,
num_threads=5)
def input_fn(data_file, num_epochs, shuffle, batch_size):
"""为Estimator创建一个input function"""
assert tf.gfile.Exists(data_file), "{0} not found.".format(data_file)
def parse_csv(line):
print("Parsing", data_file)
# tf.decode_csv会把csv文件转换成很a list of Tensor,一列一个。record_defaults用于指明每一列的缺失值用什么填充
columns = tf.decode_csv(line, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K') # tf.equal(x, y) 返回一个bool类型Tensor, 表示x == y, element-wise
dataset = tf.data.TextLineDataset(data_file) \
.map(parse_csv, num_parallel_calls=5)
# dataframe转tensor https://cloud.tencent.com/developer/ask/135418
# dataframe 转 tensorflow
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'] + _NUM_EXAMPLES['validation'])
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
model
def build_estimator(model_dir, model_type):
"""Build an estimator."""
if model_type == "wide":
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns + crossed_columns)
elif model_type == "deep":
m = tf.estimator.DNNClassifier(
model_dir=model_dir,
feature_columns=deep_columns,
hidden_units=[100, 50])
else:
m = tf.estimator.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=crossed_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])
return m
train_and_eval
def train_and_eval(model_dir, model_type, train_steps, train_data, test_data):
"""Train and evaluate the model."""
#train_file_name, test_file_name = maybe_download(train_data, test_data)
train_file_name = './data/adult.data'
test_file_name = './data/adult.test'
#model_dir = tempfile.mkdtemp() if not model_dir else model_dir
m = build_estimator(model_dir, model_type)
# set num_epochs to None to get infinite stream of data.
m.train(
input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
steps=train_steps)
# set steps to None to run evaluation until all data consumed.
results = m.evaluate(
input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
steps=None)
print("model directory = %s" % model_dir)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
main
model_dir = './model2/wide_deep'
def main(_):
train_and_eval(model_dir, FLAGS.model_type, FLAGS.train_steps,
FLAGS.train_data, FLAGS.test_data)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
parser.add_argument(
"--model_dir",
type=str,
default="",
help="Base directory for output models."
)
parser.add_argument(
"--model_type",
type=str,
default="wide_n_deep",
help="Valid model types: {'wide', 'deep', 'wide_n_deep'}."
)
parser.add_argument(
"--train_steps",
type=int,
default=2000,
help="Number of training steps."
)
parser.add_argument(
"--train_data",
type=str,
default="",
help="Path to the training data."
)
parser.add_argument(
"--test_data",
type=str,
default="",
help="Path to the test data."
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
在自己数据集上训练
在自己数据上进行训练,其实只需要改变的是input_fn函数,这里的数据是一个hive表,存在hdfs上的数据,数据格式是:feature1,feature2,feature3,…,label。这里使用的平台是pyspark,利用spark.sql读取hive表,转为pandas的DataFrame。如果你的数据是pandas的DataFrame或者是numpy其实都是可以的。这里存在的隐患是:如果数据是千万级,而每个样本特征维度是百维以上,那么转pandas的时候存在内存溢出,下一步尝试,直接读取hdfs数据。
def input_fn(num_epochs, shuffle):
feature_list_sql = "select \
base_sex,base_age,base_edu, \
base_marry,base_profession,base_city_level, \
browse_times_15_cate,browse_times_30_cate,browse_times_60_cate,browse_times_90_cate, \
label \
from X_X_feature_v1 \
where id_ ='92' and train_flag='1'"
feature_df = spark.sql(feature_list_sql)
feature_df_pd = feature_df.limit(10000).toPandas()
feature_df_pd['label'] = feature_df_pd['label'].astype("int")
feature_df_pd['base_age'] = feature_df_pd['base_age'].astype("int")
feature_df_pd = feature_df_pd.dropna(how="any", axis=0)
labels = feature_df_pd['label']
#labels = feature_df_pd.pop('label')
print('******************************************')
print(feature_df_pd.columns)
print('******************************************')
return tf.estimator.inputs.pandas_input_fn(
x=feature_df_pd,
y=labels,
batch_size=100,
num_epochs=num_epochs,
shuffle=shuffle,
num_threads=5)
#labels = feature_df_pd["label"].apply(lambda x: ">50K" in x).astype(int)
参考:
https://blog.csdn.net/u013608336/article/details/78031788
https://www.jianshu.com/p/6868fc1f65d0