人工智能小白日记之12 ML学习篇之8逻辑回归实战

前言

这一篇接上一篇中3节的编程练习,位置在
在这里插入图片描述
与在之前的练习中一样,将使用加利福尼亚州住房数据集,但这次我们会预测某个城市街区的住房成本是否高昂,从而将其转换成一个二元分类问题。

学习目标:

课程内容

1 将问题构建为二元分类问题

ps:数据集的目标是 median_house_value,它是一个数值(连续值)特征。而我们现在是要判断这个价格是高还是低,按照前面所了解的,需要定一个阈值,来告诉我们median_house_value的值是高了还是低了。所有高于此阈值的房屋价值标记为 1,其他值标记为 0。

还是一样加载加利福利亚州的数据,然后对其随机化

from __future__ import print_function

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

#如果出现证书问题,添加以下代码
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#加载数据
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_train.csv", sep=",")

#随机化
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

然后定义特征和标签

def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Create a boolean categorical feature representing whether the
  # median_house_value is above a set threshold.
  output_targets["median_house_value_is_high"] = (
    california_housing_dataframe["median_house_value"] > 265000).astype(float)
  return output_targets

ps:注意,这里有点不一样了,以前的标签y定义的是median_house_value,所以就成了一个线性回归问题。现在的标签y定义为median_house_value_is_high,将其转换为一个二元分类的逻辑回归问题。

然后划分训练集和验证集(ps:总共17000条数据,12000条做训练集,5000条做验证集,还记得测试集在哪吗?前面最后重新获取了一个测试集的数据)

# Choose the first 12000 (out of 17000) examples for training.
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_targets = preprocess_targets(california_housing_dataframe.head(12000))

# Choose the last 5000 (out of 17000) examples for validation.
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))

# Double-check that we've done the right thing.
print("Training examples summary:")
display.display(training_examples.describe())
print("Validation examples summary:")
display.display(validation_examples.describe())

print("Training targets summary:")
display.display(training_targets.describe())
print("Validation targets summary:")
display.display(validation_targets.describe())

2 线性回归会有怎样的表现?

这里很显然作者要挖坑了,逻辑回归的问题,还用线性回归的路子做,是不是挖坑。后面的代码跟以前线性回归里面的代码差不多,就不贴了。直接跑一下:
在这里插入图片描述
ps:尽管我尽量调整超参数,还是不咋地啊,误差没有收敛

2-1 任务1:我们可以计算这些预测的对数损失函数吗?

检查预测,并确定是否可以使用它们来计算对数损失函数。

LinearRegressor 使用的是 L2 损失,在将输出解读为概率时,它并不能有效地惩罚误分类。例如,对于概率分别为 0.9 和 0.9999 的负分类样本是否被分类为正分类,二者之间的差异应该很大,但 L2 损失并不会明显区分这些情况。

相比之下,LogLoss(对数损失函数)对这些"置信错误"的惩罚力度更大。请注意,LogLoss 的定义如下:

在这里插入图片描述
但我们首先需要获得预测值。我们可以使用 LinearRegressor.predict 获得预测值。

我们可以使用预测和相应目标计算 LogLoss 吗?

ps:我取,这题好难,我要实现上面那个公式吗?。。。一脸懵逼。直接放弃看答案

官方答案:

predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                  validation_targets["median_house_value_is_high"], 
                                                  num_epochs=1, 
                                                  shuffle=False)

validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)
validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])

_ = plt.hist(validation_predictions)

这里用线性回归预测了一下,然后把结果转为np的数组,最后显示直方图
在这里插入图片描述
ps:emm…这个答案实在是,棒(kan)棒(bu)哒(dong),这就行了?有看明白的老铁可以教我一下啊

2-2 任务 2:训练逻辑回归模型并计算验证集的对数损失函数

要使用逻辑回归非常简单,用 LinearClassifier 替代 LinearRegressor 即可。完成以下代码。

注意:在 LinearClassifier 模型上运行 train() 和 predict() 时,您可以通过返回的字典(例如 predictions[“probabilities”])中的 “probabilities” 键获取实值预测概率。Sklearn 的 log_loss 函数可基于这些概率计算对数损失函数,非常方便。

分析:这题是想告诉我们,对于逻辑回归,得使用对应的模型LinearClassifier。然后效果肯定很好。然后告诉我们,为了计算LogLoss,我们需要知道y和ypred,上一节了解过,分别代表实际样本标签和对应预测值,数值都是概率[0,1]。可以通过LinearClassifier.predict()之后的结果拿到这个ypred,最后通过Sklearn 的 log_loss 函数计算损失函数

def train_linear_classifier_model(
    learning_rate,
    steps,
    batch_size,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
  """Trains a linear classification model.
  
  In addition to training, this function also prints training progress information,
  as well as a plot of the training and validation loss over time.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
      
  Returns:
    A `LinearClassifier` object trained on the training data.
  """

  periods = 10
  steps_per_period = steps / periods
  
  # Create a linear classifier object.
  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)  
  linear_classifier = tf.estimator.LinearClassifier(
      feature_columns=construct_feature_columns(training_examples),
      optimizer=my_optimizer
  )
  
  # Create input functions.
  training_input_fn = lambda: my_input_fn(training_examples, 
                                          training_targets["median_house_value_is_high"], 
                                          batch_size=batch_size)
  predict_training_input_fn = lambda: my_input_fn(training_examples, 
                                                  training_targets["median_house_value_is_high"], 
                                                  num_epochs=1, 
                                                  shuffle=False)
  predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                    validation_targets["median_house_value_is_high"], 
                                                    num_epochs=1, 
                                                    shuffle=False)
  
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print("Training model...")
  print("LogLoss (on training data):")
  training_log_losses = []
  validation_log_losses = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    linear_classifier.train(
        input_fn=training_input_fn,
        steps=steps_per_period
    )
    # Take a break and compute predictions.    
    training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)
    training_probabilities = np.array([item['probabilities'] for item in training_probabilities])
    
    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
    validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])
    
    training_log_loss = metrics.log_loss(training_targets, training_probabilities)
    validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)
    # Occasionally print the current loss.
    print("  period %02d : %0.2f" % (period, training_log_loss))
    # Add the loss metrics from this period to our list.
    training_log_losses.append(training_log_loss)
    validation_log_losses.append(validation_log_loss)
  print("Model training finished.")
  
  # Output a graph of loss metrics over periods.
  plt.ylabel("LogLoss")
  plt.xlabel("Periods")
  plt.title("LogLoss vs. Periods")
  plt.tight_layout()
  plt.plot(training_log_losses, label="training")
  plt.plot(validation_log_losses, label="validation")
  plt.legend()

  return linear_classifier

,**

def train_linear_classifier_model(
    learning_rate,
    steps,
    batch_size,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
  """Trains a linear classification model.
  
  In addition to training, this function also prints training progress information,
  as well as a plot of the training and validation loss over time.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
      
  Returns:
    A `LinearClassifier` object trained on the training data.
  """

  periods = 10
  steps_per_period = steps / periods
  
  # Create a linear classifier object.
  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)  
  linear_classifier = tf.estimator.LinearClassifier(
      feature_columns=construct_feature_columns(training_examples),
      optimizer=my_optimizer
  )
  
  # Create input functions.
  training_input_fn = lambda: my_input_fn(training_examples, 
                                          training_targets["median_house_value_is_high"], 
                                          batch_size=batch_size)
  predict_training_input_fn = lambda: my_input_fn(training_examples, 
                                                  training_targets["median_house_value_is_high"], 
                                                  num_epochs=1, 
                                                  shuffle=False)
  predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                    validation_targets["median_house_value_is_high"], 
                                                    num_epochs=1, 
                                                    shuffle=False)
  
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print("Training model...")
  print("LogLoss (on training data):")
  training_log_losses = []
  validation_log_losses = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    linear_classifier.train(
        input_fn=training_input_fn,
        steps=steps_per_period
    )
    # Take a break and compute predictions.    
    training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)
    training_probabilities = np.array([item['probabilities'] for item in training_probabilities])
    
    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
    validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])
    
    training_log_loss = metrics.log_loss(training_targets, training_probabilities)
    validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)
    # Occasionally print the current loss.
    print("  period %02d : %0.2f" % (period, training_log_loss))
    # Add the loss metrics from this period to our list.
    training_log_losses.append(training_log_loss)
    validation_log_losses.append(validation_log_loss)
  print("Model training finished.")
  
  # Output a graph of loss metrics over periods.
  plt.ylabel("LogLoss")
  plt.xlabel("Periods")
  plt.title("LogLoss vs. Periods")
  plt.tight_layout()
  plt.plot(training_log_losses, label="training")
  plt.plot(validation_log_losses, label="validation")
  plt.legend()

  return linear_classifier

注意:这里有几处修改
1)metrics.log_loss(training_targets, training_probabilities)
metrics在一开始已经引入,属于sklearn里的类,通过调用其log_loss计算其损失。

参数training_targets指的是样本标签,也就是之前转换的median_house_value_is_high,这是一个只有0,1的二元分类的值;training_probabilities则通过predict后的结果中获取,获取对应的值。一样打印一下
在这里插入图片描述
打印一下其中一次执行的结果:
在这里插入图片描述
最后训练结果,不再是RMSE平方损失,而是LogLoss这是对数损失
在这里插入图片描述

2-3 任务 3:计算准确率并为验证集绘制 ROC 曲线

分类时非常有用的一些指标包括:模型准确率、ROC 曲线和 ROC 曲线下面积 (AUC)。我们会检查这些指标。

LinearClassifier.evaluate 可计算准确率和 AUC 等实用指标。

evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)

print("AUC on the validation set: %0.2f" % evaluation_metrics['auc'])
print("Accuracy on the validation set: %0.2f" % evaluation_metrics['accuracy'])

运行结果:
在这里插入图片描述
您可以使用类别概率(例如由 LinearClassifier.predict 和 Sklearn 的 roc_curve 计算的概率)来获得绘制 ROC 曲线所需的真正例率和假正例率。

validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
# Get just the probabilities for the positive class.
validation_probabilities = np.array([item['probabilities'][1] for item in validation_probabilities])

false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(
    validation_targets, validation_probabilities)
plt.plot(false_positive_rate, true_positive_rate, label="our model")
plt.plot([0, 1], [0, 1], label="random classifier")
_ = plt.legend(loc=2)

运行结果:
在这里插入图片描述
看看您是否可以调整任务 2 中训练的模型的学习设置,以改善 AUC。

通常情况下,某些指标在提升的同时会损害其他指标,因此您需要找到可以实现理想折中情况的设置。

验证所有指标是否同时有所提升。

分析:这里给出了如何求AUC以及如何绘制TPR-FPR曲线也就是ROC曲线。然后我们可以调整训练模型的超参数设置,来改善AUC。AUC一节如果看了我给的链接就知道AUC越大越接近1,那么这个方案就越好。当然对于不同模型,可能不是选择面积最大的。但是现在是同一模型,只要尽量让其AUC变大就可以了。

看看官方给的结果:

linear_classifier = train_linear_classifier_model(
    learning_rate=0.000003,
    steps=20000,
    batch_size=500,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)

print("AUC on the validation set: %0.2f" % evaluation_metrics['auc'])
print("Accuracy on the validation set: %0.2f" % evaluation_metrics['accuracy'])

在这里插入图片描述
官方的答案有一段说明“一个可能有用的解决方案是,只要不过拟合,就训练更长时间。要做到这一点,我们可以增加步数和/或批量大小。所有指标同时提升,这样,我们的损失指标就可以很好地代理 AUC 和准确率了。注意它是如何进行很多很多次迭代,只是为了再尽量增加一点 AUC。这种情况很常见,但通常情况下,即使只有一点小小的收获,投入的成本也是值得的。”

小结

该篇向我们展示了逻辑回归的处理流程。从转换二元分类问题,到对数损失函数log_loss,再到ROC曲线和AUC。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值