tensorflow实战
首先了解一下什么是tensorflow,这是google提供的一个开源软件库…..恩就这样。
我们通过python调用tensorflow api完成一系列机器学习操作
在第一站中我讲了怎么安装anaconda,我们用自带的spyder进行编写:
界面如下(可自行调节),类似于matlab有文件编辑区和命令行编辑区,非常方便调试:
下面我们就开始编写第一次代码,分为几步
1. 加载库
2. 加载数据
3. 设置特征和标签
4. 配置LinearRegressor
5. 预处理函数
6. 训练模型
7. 评估模型
加载库&&基础设置
加载库,各个库有什么用请自行百度….
附录tensorflow库的文档
import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
下面进行基础设置,主要是程序报错信息级别和输出设置
#set the log record
#tutorial: http://lib.csdn.net/article/aiframework/61081
tf.logging.set_verbosity(tf.logging.ERROR)
#display 10 rows at most
pd.options.display.max_rows = 10
#set display format
pd.options.display.float_format = '{:.1f}'.format
加载数据
加载数据
#load data
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
对数据进行随机化处理,确保不会出现病态排序
#reorder as random
california_housing_dataframe = california_housing_dataframe.reindex(np.random.permutation(california_housing_dataframe.index))
#change the unit to thousand
california_housing_dataframe["median_house_value"] /= 1000.0
检查数据:查看数据各列的简单统计信息,有异常就排查
print(california_housing_dataframe.describe())
设置特征和标签
对于data,我们需要设定本次机器学习时的特征和标签,首先我们用total_rooms来预测median_house_value
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
# Define the label.
targets = california_housing_dataframe["median_house_value"]
配置LinearRegressor
我们将使用LinearRegressor配置线性回归模型,并使用 GradientDescentOptimizer(它会实现小批量随机梯度下降法 (SGD))训练该模型。
#adjust learning_ratea to optimize
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
#limit the gradient to 5
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
linear_regressor = tf.estimator.LinearRegressor(feature_columns=feature_columns,optimizer=my_optimizer)
预处理函数
定义好训练模型后,就需要将数据导入训练模型,这就需要一个预处理:
1. 将从csv导入的数据改成numpy数组字典
2. 将数据拆分成batch_size大小的多批数据,以按照指定周期数 (num_epochs) 进行重复。(大概是对数据进行重复学习)
3. 最终输入函数会得到一个迭代器,由训练器调用得到下一批数据迭代器讲解在这里
def my_input_fn(features,targets,batch_size=1,shuffle=True,num_epochs=None):
"""Trains a linear regression model of one feature.
Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""
# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}
# Construct a dataset, and configure batching/repeating
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
# Shuffle the data, if specified
if shuffle:
ds = ds.shuffle(buffer_size=10000)
# Return the next batch of data
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
单独讲一下第一步转化成数组字典,通过pandas导入的数据类型是pandas的dataframe戳这里看pandas是什么
这个dataframe和R语言很像,有一个序号列,data根据feature以列的形式排列:
而我们使用的是数组字典what‘s dictionary???内容一致但是形式不同:(列表to行表?)
训练模型
到这步就很简单了,调用函数即可:
_ = linear_regressor.train(input_fn = lambda:my_input_fn(my_feature,targets),steps=100)
这里使用lambda封装是为了调用形参…..lambda是撒子哟???
评估模型
ojbk,我们已经完成了机器学习,下面就可以展开预测并对模型进行评估
同样对于要预测的数据,要有预处理函数
#Since we're making just one prediction for each example, we don't need to repeat or shuffle the data here.
prediction_input_fn =lambda: my_input_fn(my_feature, targets,num_epochs=1, shuffle=False)
进行预测啦:
# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
下面就要分析预测的好坏,对于进行预测的feature,我们有实际的label和预测出来的target,计算他的均方差,就能很好的描述预测的好坏:
# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])
# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
print "Root Mean Squared Error (on training data):%0.3f"%root_mean_squared_error
心情好还可以取一些sample,分别画出预测直线和实际值散点,直观的看出误差
# Get the min and max total_rooms values.
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()
# Retrieve the final weight and bias generated during training.
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
# Get the predicted median_house_values for the min and max total_rooms values.
y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias
# Plot our regression line from (x_0, y_0) to (x_1, y_1).
plt.plot([x_0, x_1], [y_0, y_1], c='r')
# Label the graph axes.
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")
# Plot a scatter plot from our data sample.
plt.scatter(sample["total_rooms"], sample["median_house_value"])
# Display graph.
plt.show()
虽然我很想贴图但是谷歌的csv突然访问不了了orz就算了吧
练一练
刚刚我们做了很多工作,完成了模型的建立和评估,能不能把他们封装成一个函数呢?
可调参数为learning_rate,同时可以给出模型运行的过程(提示:将step分为step_per_period,在每个period中算出loss并输出):
在训练开始和结束添加输出使得过程更加直观:
over