由于没有对特征进行处理(处理方法前面写过,懒得再写了,而且原文也没写),所以模型训练结果并不好。
我们这里只是简单的讲解下,框架的搭建。文章线性回归的部分不再提及了,因为也就是改一两行代码的事。
我们先将数据分为训练集 和验证集 并且创建一个新的column叫 median_house_value_high, 这个列是基于medain_house_value 来决定的,median_house_value大于75%的计作1,否则计作0 。我们主要讲解下 回归模型。
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.data import Dataset
import matplotlib.pyplot as plt
df = pd.read_csv('california_housing_train.csv')
df['median_house_value_high'] = (df['median_house_value'] > 265000).astype('float32')
df['rooms_per_person'] = df['total_rooms'] / df['population']
df = df.reindex(np.random.permutation(df.index))
df = df.sort_index()
df_features = df.drop(['median_house_value_high', 'median_house_value'], axis=1, inplace=False).copy()
df_targets = df['median_house_value_high'].copy()
training_features = df_features.head(12000).astype('float32')
training_targets = df_targets.head(12000).astype('float32')
validation_features = df_features.tail(5000).astype('float32')
validation_targets = df_targets.tail(5000).astype('float32')
同样我们先把数据分为训练集和测试集,这里因为我们用median_house_value来得到的targets,所以不再把median_house_value作为输入特征。
下面依旧是用dataset把df变为tensor:
def my_input_fn(features, targets, batch_size=1, num_epochs=1, shuffle=False):
features = {key: np.array(value) for key, value in dict(features).items()}
ds = Dataset.from_tensor_slices((features, targets))
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds.shuffle(10000)
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
之前文章提过,逻辑回归很依赖于正则化,为了防止过拟合,我们这里在定义layer的时候别忘记加上l2正则化。
def add_layer(inputs, input_size, output_size, activation_function=None):
weights = tf.Variable(tf.random_normal([input_size, output_size], stddev=0.1))
tf.add_to_collection('losses', tf.contrib.layers.l2_regularizer(.1)(weights))
biases = tf.Variable(tf.zeros(output_size) + 0.1)
wx_b = tf.matmul(inputs, weights) + biases
if activation_function is None:
outputs = wx_b
else:
outputs = activation_function(wx_b)
return weights, biases, outputs
之前我们提到过logloss函数,我们这回依旧用这个,别忘了正则化项。
def log_loss(pred, ys):
logloss = tf.reduce_mean(-ys * tf.log(pred) - (1-ys)*tf.log(1-pred))
loss = logloss + tf.add_n(tf.get_collection('losses'))
return loss
剩下我们就需要定义训练方法了,这里我依旧选用的adam。
def train_step(learning_rate, loss):
train = tf.train.AdamOptimizer(learning_rate).minimize(loss)
return train
由于my_input_fn返回的features是字典形式,上一次一个一个expand_dim实在有点烦,这回写个函数expand_dim了
def expand_dim(_dict):
for key in _dict:
_dict[key] = tf.expand_dims(_dict[key], -1)
return _dict
原文中让我们画roc的图像,那我们就定义一下。我这里biases指代的分类阈值。仔细看了下,但愿没写错。
def roc(pred, targets, biases):
if len(pred) != len(targets):
raise Exception('预测长度与目标长度不等')
else:
TP = 0
TN = 0
FP = 0
FN = 0
for i in range(len(pred)):
if pred[i] > biases and targets[i] == 1:
TP += 1
elif pred[i] > biases and targets[i] == 0:
FP += 1
elif pred[i] < biases and targets[i] == 1:
FN += 1
else:
TN += 1
accuracy = (TP+TN)/(TP+TN+FP+FN)
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
return accuracy, TPR, FPR
老步骤构建神经网络,最后一层用激活函数用sigmoid,其他层用什么,或者添加多少隐藏层就看个人了。这里第一层添加tanh是个很不明智的做法,因为存在梯度饱和的问题。但是我们没有对数据进行预处理,先这么对付用吧。
xs, ys = my_input_fn(training_features, training_targets, batch_size=200, num_epochs=100)
xv, yv = my_input_fn(validation_features, validation_targets, batch_size=5000, num_epochs=1100)
xs = expand_dim(xs)
xv = expand_dim(xv)
xs1, xs2, xs3, xs4, xs5, xs6, xs7, xs8, xs9 = xs.values()
_inputs = tf.concat([xs1, xs2, xs3, xs4, xs5, xs6, xs7, xs8, xs9], -1)
xv1, xv2, xv3, xv4, xv5, xv6, xv7, xv8, xv9 = xv.values()
xv_inputs = tf.concat([xv1, xv2, xv3, xv4, xv5, xv6, xv7, xv8, xv9], -1)
w1, b1, l1 = add_layer(_inputs, 9, 100, activation_function=tf.nn.tanh)
w2, b2, l2 = add_layer(l1, 100, 40, activation_function=None)
w3, b3, pred = add_layer(l2, 40, 1, activation_function=tf.nn.sigmoid)
loss = log_loss(pred, ys)
train = train_step(0.0001, loss)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
依旧训练,同时打印loss。如果还记得上一篇文章,那么应该还记得一个优秀的模型预测偏差应该接近于0,反之不成立。这里我们同样也打印出预测偏差。
for i in range(5000):
sess.run(train)
if i % 50 == 0:
v_l1 = tf.nn.tanh(tf.matmul(xv_inputs, w1) + b1)
v_l2 = tf.matmul(v_l1, w2) + b2
v_pred = tf.nn.sigmoid(tf.matmul(v_l2, w3) + b3)
v_loss = log_loss(v_pred, yv)
projection_bias = tf.reduce_mean(v_pred - yv)
print(sess.run([v_loss, projection_bias]))
最后我们来画图(AUC这里就不算了,如果想仔细处理这个模型,请先做好特征工程):
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
plt.ion()
plt.show()
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
for i in np.arange(0., 1., 0.001):
v_l1 = tf.nn.tanh(tf.matmul(xv_inputs, w1) + b1)
v_l2 = tf.matmul(v_l1, w2) + b2
v_pred = tf.nn.sigmoid(tf.matmul(v_l2, w3) + b3)
pred_roc, targets_roc = sess.run([v_pred, yv])
accuracy, tpr, fpr = roc(pred_roc, targets_roc, i)
print('accuracy:', accuracy, 'biases:', i)
ax.scatter(fpr, tpr)
plt.pause(0.1)
这是我得到的一张图:
我们的数据集并不大,而且分类的确是不均衡。在biases到0.3左右的时候 准确率基本不再变化。可见机器基本都是奔着TN去预测了。不过之后我们还有很多编程练习,到时候我们再讨论已存的问题(scatter默认颜色居然不是单一颜色(这个可以通过参数c来更改))。如果想要点外面有个框可以用 linewidths参数来更改。scatter的参数一搜一大堆,这里不再赘述了。