深度学习笔记——深度学习框架TensorFlow(六)[TensorFlow线性模型教程]

参考资料:

  1. https://www.tensorflow.org/versions/r0.12/tutorials/wide/index.html#tensorflow-linear-model-tutorial

TensorFlow Linear Model Tutorial

In this tutorial, we will use the TF.Learn API in TensorFlow to solve a binary classification problem: Given census data about a person such as age, gender, education and occupation (the features), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target label). We will train a logistic regression model, and given an individual’s information our model will output a number between 0 and 1, which can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.
在本教程中,我们将使用TF.Learn的API解决一个二元分类问题:给定的普查数据对一个人如年龄、性别、教育和职业(的功能),我们将试图预测是否个人赚超过50000美元一年(目标标签)。我们将训练一个logistic regression(逻辑回归)模型,并给出一个人的信息,我们的模型将输出一个0到1之间的数字,这可以被解释为个人年收入超过50000美元的概率。

Reading The Census Data:
(读取人口普查数据)

The dataset we’ll be using is the Census Income Dataset. You can download the training data and test data manually or use code like this:
我们将使用的数据集是人口普查收入数据集。您可以手动下载培训数据和测试数据或使用下面代码进行下载。

import tensorflow as tf
import tempfile
import urllib
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",train_file.name)
urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",test_file.name)

Once the CSV files are downloaded, let’s read them into Pandas dataframes.
一旦CSV文件下载,我们就可以读到Pandas(pandas是Python的数据分析包,介绍:http://www.open-open.com/lib/view/open1402477162868.html,安装步骤:http://www.cnblogs.com/lxmhhy/p/6029465.html,github:https://github.com/pandas-dev)的数据帧。

import pandas as pd
COLUMNS = ["age","workclass","fnlwgt","education","education_num"
           "marital_status","occupation","relationship","race","gender",
           "capital_gain","capital_loss","hours_per_week","native_country",
           "income_bracket"]
df_train=pd.read_csv("adult.data",names=COLUMNS,skipinitialspace=True)
df_test = pd.read_csv("adult.test", names=COLUMNS, skipinitialspace=True, skiprows=1)

(我已经手动把数据下载下来的,所以这里直接用了adult.data(https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)和adult.test(https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test))

Since the task is a binary classification problem, we’ll construct a label column named “label” whose value is 1 if the income is over 50K, and 0 otherwise.
因为任务是一个二元分类问题,我们将构建一个标签栏命名为“标签”的值是1,如果收入超过50K,否则为0。

LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test["income_bracket"].apply(lambda x:"50K" in x)).astype(int)

(按照我的理解:lambda就是相当于告诉我们x是一个变量,astype就是把这值变成int类型,大于50K则为1,小于50K则为0)

Next, let’s take a look at the dataframe and see which columns we can use to predict the target label. The columns can be grouped into two types—categorical and continuous columns:
1. A column is called categorical if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
2. A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.
接下来,让我们看一看这帧列我们可以用来预测目标标签。列可分为两种类型:分类和连续列:
1. 如果一个列的值只能是有限集合中的一个类别,则称它为一个列。例如,一个人的祖籍国(美国、印度、日本等)或教育水平(高中、大学等)属于分类列。
2. 如果列的值可以是连续范围内的任何数值,则称之为连续的列。例如,一个人的资本增值(如14084美元)是连续的列。

CATEGORICAL_COLUMNS = ["workclass","education","marital_status","occupation","relationship","race","gender","native_country"]
CONTINUOUS_COLUMNS = ["age","education_num","capital_gain","capital_loss","hours_per_week"]

Here’s a list of columns available in the Census Income dataset:
下面是人口普查收入数据集中可用的列列表:
这里写图片描述

Converting Data into Tensors:
When building a TF.Learn model, the input data is specified by means of an Input Builder function. This builder function will not be called until it is later passed to TF.Learn methods such as fit and evaluate. The purpose of this function is to construct the input data, which is represented in the form of Tensors or SparseTensors. In more detail, the Input Builder function returns the following as a pair:
在构建一个TF学习模型时,输入数据是通过输入生成器函数指定的。当TF.Learn方法中的fit和evaluate方法被调用时,这个函数才会被唤醒。这个函数的目的是建立可以被表达成Tensors或SparseTensors形式的输入数据。更详细地说,输入生成器函数返回一组以下内容:
1. feature_cols: A dict from feature column names to Tensors or SparseTensors.
2. label: A Tensor containing the label column.
1. feature_cols:一个字典特征列命名成Tensors或SparseTensors张量。
2. lablel:一个包含标签列的Tensor

The keys of the feature_cols will be used to construct columns in the next section. Because we want to call the fit and evaluate methods with different data, we define two different input builder functions, train_input_fn and test_input_fn which are identical except that they pass different data to input_fn. Note that input_fn will be called while constructing the TensorFlow graph, not while running the graph. What it is returning is a representation of the input data as the fundamental unit of TensorFlow computations, a Tensor (or SparseTensor).
该feature_cols的键将在下一节中被用来构造成列。因为我们想要通过不同数据,唤醒fit和evaluate方法,因此我们定义了两种不同输入生成器功能,train_input_fn和test_input_fn是等价的除了他们传递不同的数据给input_fn。注意,在构建tensorflow graph时input_fn将被唤醒。它返回的是以tensorflow计算的基本单位的为形式的输入数据,一个张量(或SparseTensor)

Our model represents the input data as constant tensors, meaning that the tensor represents a constant value, in this case the values of a particular column of df_train or df_test. This is the simplest way to pass data into TensorFlow. Another more advanced way to represent input data would be to construct an Input Reader that represents a file or other data source, and iterates through the file as TensorFlow runs the graph. Each continuous column in the train or test dataframe will be converted into a Tensor, which in general is a good format to represent dense data. For cateogorical data, we must represent the data as a SparseTensor. This data format is good for representing sparse data.
我们的模型表示输入数据为常数tensors,即tensor是一个恒定值,在这种情况下,一个特定的列df_train或df_test值。这是最简单的方式传递数据到TensorFlow。另一个代表输入数据更先进的方法是构造一个输入读者表示一个文件或其他数据源,并遍历文件tensorflow运行图。在训练或测试数据帧的每个连续的列将被转换成一个张量,它一般是一个很好的格式来表示数据密集型。对于cateogorical数据,必须将数据表示为一sparsetensor。此数据格式有利于表示稀疏数据。

import tensorflow as tf
def input_fn(df):
    #Creates a dictionary mapping from each continuous feature column name(k) to
    #the values of that column stored in a constant Tensor.
    continuous_cols = {k:tf.constant(df[k].values) for k in CONTINUOUS_COLUMNS}
    #Creates a dictionary mapping from each categorical feature column name(k) to the values of that column stored in a tf.SparseTensor.
    categorical_cols = {k:tf.SparseTensor(indices[[i,0] for i in range(df[k].size)],values=df[k].values,dense_shape=[df[k].size,1]) for k in CATEGORICAL_COLUMNS}#把每一个categorical特征都挑出来,每个特征构成1列32561行(普查的人数)的矩阵,由于k是一直在变的,所以最终的categorical_cols.items有八个,皆是categorical特征。
    #Merges the two dictionaries into one.
    feature_cols = dict(continuous_cols.items()|categorical_cols.items())
    #Converts the label column into a constant Tensor.
    label = tf.constant(df[LABEL_COLUMN].values)
    #Returns the feature columns and the label.
    return feature_cols,label

def train_input_fn():
    return input_fn(df_train)
def eval_input_fn():
    return input_fn(df_test)

Selecting and Engineering Features for the Model:
(为模型选择合适的特征)
Selecting and crafting the right set of feature columns is key to learning an effective model. A feature column can be either one of the raw columns in the original dataframe (let’s call them base feature columns), or any new columns created based on some transformations defined over one or multiple base columns (let’s call them derived feature columns). Basically, “feature column” is an abstract concept of any raw or derived variable that can be used to predict the target label.
选择和制作合适的特性列是学习有效模型的关键。一个功能柱可以在原始数据的原始列(我们称之为base feature columns),或任何由其他列变换而来产生的新列(我们称之为derived feature columns)。基本上,“特征列”是任何原始或派生变量的抽象概念,可用来预估目标label。

Base Categorical Feature Columns:
(基于离散的特征列)

To define a feature column for a categorical feature, we can create a SparseColumn using the TF.Learn API. If you know the set of all possible feature values of a column and there are only a few of them, you can use sparse_column_with_keys. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the gender column we can assign the feature string “Female” to an integer ID of 0 and “Male” to 1 by doing:
为离散的特征定义一个特征列,我们可以通过TF.Learn的API创建一个SparseColumn。如果你知道一个列中的所有可能的特征值的集合,只有很少的一部分,你可以使用sparse_column_with_keys。列表中的每一个键都将从0开始分配自动增量ID,例如,对于性别专栏,我们可以将特征字符串“女性”分配给0的整数ID,然后将“男性”分配给1。

如果我们事先不知道可能的值集怎么办?不是问题。我们可以用sparse_column_with_hash_bucket

education = tf.contrib.layers.sparse_column_with_hash_bucket("eduction",hash_bucket_size=1000)

What will happen is that each possible value in the feature column education will be hashed to an integer ID as we encounter them in training. See an example illustration below:
代码运行后会发生什么事呢?在教育这个特征列中每一个可能的值将映射到整数ID,作为训练时计算的值。参见下面的示例说明:

ID Feature

9 “Bachelors”

103 “Doctorate”

375 “Masters”

No matter which way we choose to define a SparseColumn, each feature string will be mapped into an integer ID by looking up a fixed mapping or by hashing. Note that hashing collisions are possible, but may not significantly impact the model quality. Under the hood, the LinearModel class is responsible for managing the mapping and creating tf.Variable to store the model parameters (also known as model weights) for each feature ID. The model parameters will be learned through the model training process we’ll go through later.
不管我们选择哪种方式来定义一个sparsecolumn,每个特征的字符串都会通过固定的映射或散列被映射到一个整数ID。注意散列碰撞是可能的,但不会显著影响模型的质量。在这个机制下,该LinearModel类负责管理映射和创造tf.Variable以为每个特征ID存储模型参数(又称模型权重)。模型参数将通过模型训练过程学习。

We’ll do the similar trick to define the other categorical features:
我们将用同样的技巧来定义其他的分类特征:

relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship",hash_bucket_size = 1000)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass",hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation",hash_bucket_size = 1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country",hash_bucket_size=1000)

Base Continuous Feature Columns:
Similarly, we can define a RealValuedColumn for each continuous feature column that we want to use in the model:
同样地,我们可以为想要使用模型中的每个连续的特征列定义一个RealValuedColumn:

age = tf.contrib.layers.real_valued_column("age")
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

Making Continuous Features Categorical through Bucketization:

Sometimes the relationship between a continuous feature and the label is not linear. As an hypothetical example, a person’s income may grow with age in the early stage of one’s career, then the growth may slow at some point, and finally the income decreases after retirement. In this scenario, using the raw age as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:

  1. Income always increases at some rate as age grows (positive correlation)
  2. Income always decreases at some rate as age grows (negative correlation), or
  3. Income stays the same no matter at what age (no correlation)

有时连续特征和标签之间的关系不是线性关系。作为一个假设的例子,一个人的收入在其职业生涯的早期可能会随着年龄的增长而增长,然后在某一点上的增长可能会放缓,最后退休后的收入会减少。在这种情况下,将原始年龄作为实值值列可能不是一个好的选择,因为该模型只能学习三种情况中的一种:
1. 随着年龄的增长,收入总是以一定的速度增长(正相关)
2. 随着年龄的增长,收入总是以一定的速度下降(负相关),或者
3. 不管年龄多大(不相关),收入保持不变。

If we want to learn the fine-grained correlation between income and each age group seperately, we can leverage bucketization. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a bucketized_column over age as:
如果我们想学习收入和各年龄组之间的相关性,我们可以利用bucketization。bucketization是一个连续的特征划分成一组连续箱/桶的整个范围的过程,然后将原始数值特征转换成桶ID(作为分类特征)取决于该值落入哪个桶。因此,我们可以定义一个bucketized_column作为age:

age_buckets = tf.contrib.layers.bucketized_column(age,boundaries = [18,25,30,35,40,45,50,55,60,65])

where the boundaries is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, …, to 65 and over).
边界是一个桶边界的列表。在这种情况下,有10个边界,因此由11个年龄组(17岁及以下的桶,18-24岁,25-29岁,…,65以上)。

Intersecting Multiple Columns with CrossedColumn:
(多列交叉)
Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for education=”Bachelors” and education=”Masters”, we won’t be able to capture every single education-occupation combination (e.g. distinguishing between education=”Bachelors” AND occupation=”Exec-managerial” and education=”Bachelors” AND occupation=”Craft-repair”). To learn the differences between different feature combinations, we can add crossed feature columns to the model.
单独使用每个基本特性列可能不足以解释数据。例如,教育和标签(收入>50000dollars)之间的相关性可能因不同的职业而不同。因此,如果我们只学习一个模型权重 education=“Bachelors”和education=“Masters“,我们就无法获得每一个教育职业的组合(例如education=”Bachelors”和occupation=”Execmanagerial”,education=”Bachelors”和occupation=”Craft-repair”)。要了解不同特性组合之间的差异,我们可以向模型添加交叉特征列。

education_x_occupation = tf.contrib.layers.crossed_column([education,occupation],hash_bucket_size=int(1e4))

We can also create a CrossedColumn over more than two columns. Each constituent column can be either a base feature column that is categorical (SparseColumn), a bucketized real-valued feature column (BucketizedColumn), or even another CrossColumn. Here’s an example:
我们还可以创建超过两列的CrossedColumn。每个组成列都可以是一个基本特征列,它可以是离散的(SparseColumn),一个bucketized实值特征列(bucketizedcolumn),甚至是另一个crosscolumn。下面举个例子

age_buckets_x_education_x_occupation = tf.contrib.layers.crossed_column([age_buckets,education,occupation],hash_bucket_size = int(1e6))

Defining The Logistic Regression Model:

After processing the input data and defining all the feature columns, we’re now ready to put them all together and build a Logistic Regression model. In the previous section we’ve seen several types of base and derived feature columns, including:
1. SparseColumn
2. RealValuedColumn
3. BucketizedColumn
4. CrossedColumn
在处理输入数据和定义所有的特征列之后,我们现在就可以把它们放在一起,建立logistic回归模型。在前面的部分中,我们已经看到了几种类型的基础和派生的特性列,包括:
1. SparseColumn
2. RealValuedColumn
3. BucketizedColumn
4. CrossedColumn

All of these are subclasses of the abstract FeatureColumn class, and can be added to the feature_columns field of a model:
所有这些都是抽象的FeatureColumn类的子类,可以被添加到一个模型的feature_columns中:

model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(feature_columns=[gender,native_country,education,occupation,workclass,marital_status,race,age_buckets,education_x_occupation,age_buckets_x_education_x_occupation],model_dir = model_dir)

The model also automatically learns a bias term, which controls the prediction one would make without observing any features (see the section “How Logistic Regression Works” for more explanations). The learned model files will be stored in model_dir.
该模型还自动学习了一个偏置项,它控制了人们在没有观察任何特征的情况下的预测(参见“逻辑回归如何工作”)。模型文件将存储在model_dir。

Training and Evaluating Our Model
After adding all the features to the model, now let’s look at how to actually train the model. Training a model is just a one-liner using the TF.Learn API:
在将所有的特性添加到模型之后,现在让我们看看如何实际地训练模型。训练一个模型只是一个线性的使用:

m.fit(input_fn = train_input_fn,steps = 200)

After the model is trained, we can evaluate how good our model is at predicting the labels of the holdout data:
在模型训练,我们可以评估我们的模型,我们的模型在预测维持数据标签:

results = m.evaluate(input_fn=eval_input_fn,steps=1)
for key in sorted(results):
    print "%s:%s"%(key,results[key])

The first line of the output should be something like accuracy: 0.83557522, which means the accuracy is 83.6%. Feel free to try more features and transformations and see if you can do even better!
输出的第一行应该是精确的:0.83557522,这意味着准确率是83.6%。欢迎尝试更多的功能和转换,看看你是否能做得更好!

If you’d like to see a working end-to-end example, you can download our example code and set the model_type flag to wide.
如果你想看到一个有效的端到端的例子,你可以下载我们的示例代码(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py)和将model_type的标志设置成wide。

Adding Regularization to Prevent Overfitting:
(加入正则防止过拟合)
Regularization is a technique used to avoid overfitting. Overfitting happens when your model does well on the data it is trained on, but worse on test data that the model has not seen before, such as live traffic. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observed training data. Regularization allows for you to control your model’s complexity and makes the model more generalizable to unseen data.
正则化是用来避免过拟合技术。如果你的模型在训练时过度匹配它的数据,但糟糕的测试数据,该模型还没有见过,如现场交通。过拟合现象通常发生在一个模型过于复杂,如相对于观察到的训练数据的数量有太多的特征参数。正则化允许你控制你的模型的复杂性,使得模型更适用于看不见的数据。

In the Linear Model library, you can add L1 and L2 regularizations to the model as:
在线性模型库,您可以添加的L1和L2的规整化的模型:

One important difference between L1 and L2 regularization is that L1 regularization tends to make model weights stay at zero, creating sparser models, whereas L2 regularization also tries to make the model weights closer to zero but not necessarily zero. Therefore, if you increase the strength of L1 regularization, you will have a smaller model size because many of the model weights will be zero. This is often desirable when the feature space is very large but sparse, and when there are resource constraints that prevent you from serving a model that is too large.
L1和L2正则化的一个重要区别是,L1正则化会使模型的权重保持在零,创建稀疏模式,而L2正则化也使模型的权重接近于零但不一定为零。因此,如果增加L1正则化的强度,则会有较小的模型尺寸,因为许多模型权重将为零。当特征空间非常大但很稀疏时,这通常是可取的,并且当有资源约束时,会阻止你生成过于庞大的模型。

In practice, you should try various combinations of L1, L2 regularization strengths and find the best parameters that best control overfitting and give you a desirable model size.
在实践中,你应该尝试各种组合的L1,L2正则化的优势,找到最佳的控制过度拟合和给你一个理想的模型尺寸的最佳参数。

m = tf.contrib.learn.LinearClassifier(feature_columns=[gender,native_country,education,occupation,workclasss,marital_status,race,age_buckets,education_x_occupation,age_buckets_x_education_x_occupation],optimizer = tf.train.FtrlOptimizer(
    learning_rate=0.1,
    L1_regularization_strength=1.0,
    L2_regularization_strength=1.0),
    model_dir=model_dir)

参考网站:
1. 讲梯度下降算法的一个很好的博文:http://blog.csdn.net/xuelabizp/article/details/50878013
2. 对于L1,L2正则为什么能够防止过拟合的理解:http://blog.csdn.net/losteng/article/details/50942889http://blog.csdn.net/jackie_zhu/article/details/52134592http://www.2cto.com/kf/201609/545625.html
One important difference between L1 and L2 regularization is that L1 regularization tends to make model weights stay at zero, creating sparser models, whereas L2 regularization also tries to make the model weights closer to zero but not necessarily zero. Therefore, if you increase the strength of L1 regularization, you will have a smaller model size because many of the model weights will be zero. This is often desirable when the feature space is very large but sparse, and when there are resource constraints that prevent you from serving a model that is too large.

In practice, you should try various combinations of L1, L2 regularization strengths and find the best parameters that best control overfitting and give you a desirable model size.

小记录:
1. L1正则化是在原始损失函数后面加上正则化系数乘以((权重乘以权重绝对值)的和),L1正则化可以产生稀疏模型,L1正则化会使很多权值为0。
2. L2正则化:拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。

How Logistic Regression Works:
(Logistic 回归是如何工作的)
Finally, let’s take a minute to talk about what the Logistic Regression model actually looks like in case you’re not already familiar with it. We’ll denote the label as Y,, and the set of observed features as a feature vector x=[x1,x2,…,xd]。. We define Y=1 if an individual earned > 50,000 dollars and Y=0, otherwise. In Logistic Regression, the probability of the label being positive (Y=1) given the features x is given as:
这里写图片描述
where w=[w1,w2,…,wd] are the model weights for the features x=[x1,x2,…,xd] is a constant that is often called the bias of the model. The equation consists of two parts—A linear model and a logistic function:
Linear Model: First, we can see that wTx+b=b+w1x1+…+wdxd is a linear model where the output is a linear function of the input features x. The bias b is the prediction one would make without observing any features. The model weight wi reflects how the feature xi is correlated with the positive label. If xi is positively correlated with the positive label, the weight wi increases, and the probability P(Y=1|x) will be closer to 1. On the other hand, if xi is negatively correlated with the positive label, then the weight wi decreases and the probability P(Y=1|x) will be closer to 0.

Logistic Function: Second, we can see that there’s a logistic function (also known as the sigmoid function) S(t)=1/(1+exp⁡(−t)) being applied to the linear model. The logistic function is used to convert the output of the linear model wTx+b from any real number into the range of [0,1], which can be interpreted as a probability.
最后,让我们花一点时间来讨论一下logistic回归模型实际上是什么样子,以防你还不熟悉它。我们将把标签表示为Y,并将观察到的特征集作为特征向量x = [x1,x2,……xd]如果一个人赚了50000美元,我们定义Y = 1,否则,y = 0。在logistic回归中,给定x的特征的标签的概率是正的(y=1):
这里写图片描述
这里w=[w1,w2,…,wd]对应于特征x=[x1,x2,…,xd],b是一个常量,通常被称作模型的bias,这等式由两部分组成:
1. Linear Model:首先,我们看到这里写图片描述是一个线性模型,这里的输出是特征x的线性特征,偏差值b是一个没有任何特征的预测值。这个模型的wi表示了特征xi与标签的相关度。如果xi和正标签是正相关的,权重wi是增加的,那么P(Y=1|x)就接近于1。如果xi和正标签是负相关的,那么权重wi是减少的,P(Y=1|x)就趋近于0。
2. Logistic Function:第二,我们可以发现这是一个Logistic函数(也可以被称为sigmoid函数)S(t)=1/(1+exp(-t))被应用于线性模型。这Logistic function被用于将这里写图片描述的分为转换到[0,1]之间,即概率。

Model training is an optimization problem: The goal is to find a set of model weights (i.e. model parameters) to minimize a loss function defined over the training data, such as logistic loss for Logistic Regression models. The loss function measures the discrepancy between the ground-truth label and the model’s prediction. If the prediction is very close to the ground-truth label, the loss value will be low; if the prediction is very far from the label, then the loss value would be high.
模型训练是一个优化问题:目标是找到一组模型权重(即模型参数),以最小化在训练数据上定义的损失函数,如logistic回归模型的logistic损失。损失函数描述真实标签与模型预测之间的差异。如果模型预测值离真值标签很近,损失值就会很低;如果预测离真值标签很远,那么损失值就会很高。

Learn Deeper

If you’re interested in learning more, check out our Wide & Deep Learning Tutorial(https://www.tensorflow.org/versions/r0.12/tutorials/wide_and_deep/index.html) where we’ll show you how to combine the strengths of linear models and deep neural networks by jointly training them using the TF.Learn API.

完整代码如下:

import tensorflow as tf
import pandas as pd
import tempfile
COLUMNS = ["age","workclass","fnlwgt","education","education_num",
           "marital_status","occupation","relationship","race","gender",
           "capital_gain","capital_loss","hours_per_week","native_country",
           "income_bracket"]
df_train=pd.read_csv("adult.data",names=COLUMNS,skipinitialspace=True)
df_test = pd.read_csv("adult.test", names=COLUMNS, skipinitialspace=True, skiprows=1)
LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = (df_train["income_bracket"].apply(lambda x:">50K" in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test["income_bracket"].apply(lambda x :">50K" in x)).astype(int)
CATEGORICAL_COLUMNS = ["workclass","education","marital_status","occupation","relationship","race",
                       "gender","native_country"]
CONTINUOUS_COLUMNS = ["age","education_num","capital_gain","capital_loss","hours_per_week"]

def input_fn(df):
    #Creates a dictionary mapping from each continuous feature column name (k) to
    #the values of that column stored in a constant Tensor.
    continuous_cols = {
        k:tf.constant(df[k].values)
        for k in CONTINUOUS_COLUMNS
    }
    #Creates a dictionary mapping from each categorical feature column name (k)
    #to the values of that column stored in a tf.SparseTensor
    categorical_cols = {
        k: tf.SparseTensor(
            indices=[[i, 0] for i in range(df[k].size)],
            values=df[k].values,
            dense_shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
    #Merges the two dictionaries into one.
    feature_cols = dict(continuous_cols.items()|categorical_cols.items())
    #Convert the label column into constant Tensor
    label = tf.constant(df[LABEL_COLUMN].values)
    #Returns the feature columns and the label
    return feature_cols,label

def train_input_fn():
    return input_fn(df_train)
def eval_input_fn():
    return input_fn(df_test)

gender = tf.contrib.layers.sparse_column_with_keys(
    column_name="gender",keys=["Female","Male"])
marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status",hash_bucket_size=1000)
race = tf.contrib.layers.sparse_column_with_keys(column_name="race",keys=["Asian-Pac-Islander","White","Amer-Indian-Eskimo","Other","Black"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education",hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship",hash_bucket_size = 1000)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass",hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation",hash_bucket_size = 1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country",hash_bucket_size=1000)
age = tf.contrib.layers.real_valued_column("age")
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")
age_buckets = tf.contrib.layers.bucketized_column(age,boundaries=[18,25,30,35,40,45,50,55,60,65])
education_x_occupation = tf.contrib.layers.crossed_column([education,occupation],hash_bucket_size=int(1e4))
age_buckets_x_education_x_occupation = tf.contrib.layers.crossed_column([age_buckets,education,occupation],hash_bucket_size=int(1e6))
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(
    feature_columns=[
        gender,native_country,education,occupation,workclass,
        marital_status,race,age_buckets,education_x_occupation,
        age_buckets_x_education_x_occupation],
        model_dir=model_dir
)
m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn,steps=1)
for key in sorted(results):
    print("%s:%s"%(key,results[key]))



  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值