tensorflow实现逻辑回归,在kaggle《泰坦尼克》训练并测试准确率
主要有以下3个步骤:
- 1 数据集特征分析、预处理
- 2 基于tensorflow的逻辑回归
- 3 训练、准确率测试
1 数据集特征分析、预处理
首先看下数据集的特征
分析可得到以下数据特征信息:
- 训练数据中总共有891名乘客,但有些属性的数据不全,比如:
-Age(年龄)属性只有714名乘客有记录
-Cabin(客舱)只有204名乘客是已知的 - 有些属性为类目属性
比如:Sex(男、女),需要将其转化为数值型特征 - 有些属性为无关属性/特征
如:PassengerId 与是否获救无关,需删掉 - 数值型特征的scale不同,需要进行归一化处理
- 训练数据中总共有891名乘客,但有些属性的数据不全,比如:
据此,需要进行以下数据预处理操作:
- 属性值缺失处理:
-可采用RandomForestClassifier,根据已有数据,填补缺失的年龄属性
-Cabin特征缺失值太多,填补的数据可能不准确,考虑直接删掉 - 类目属性转化为数值型特征
如:Sex属性中,“男”用“1”表示,“女”用“0”表示。 - 删除无关属性/特征
删除特征:PassengerId、Name
(这里我把Ticket特征也删掉了,原因是:Ticket类型太多,暂时难以看出和是否获救的相关性,暂时先放一放) - 属性特征值归一化
- 属性值缺失处理:
2 基于tensorflow的逻辑回归
Logistic Regression
利用sigmoid函数,对于样本X=(x1,x2...,xn)T可以将二分类的函数写成hθ(x)=λ(θ0+θ1x1+θ2x2,...,θnxn)
其中, θ=(θ0,θ1,...,θn)T 为待学习的参数,该公式即我们熟悉的
y=Wx+b
求解参数 θ 的步骤是:- 先确定一个形如下式的整体损失函数
J(θ)=−1m∑i=1mcost(xi,yi)
需注意:在实际应用中,单个样本的损失函数 cost(xi,yi) 常取对数似然函数,即cost(xi,yi)=y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))) - 通过学习样本的特征,对参数 θ 进行迭代优化,找到损失函数最小时对应的一组 θ 值
因此,在本问题中,设置tensorflow训练时的损失函数为
J(θ)=−1m∑i=1m[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]- 先确定一个形如下式的整体损失函数
使用TensorFlow训练模型大致是这样的步骤:
- 设置各种参数,如:学习率,迭代次数
- 定义图:定义变量、模型、优化方式。如:x,y,loss function
- 初始化变量:init = tf.initialize_all_variables()
- 建立session,正式开始训练。
具体内容详见下方代码。
3 训练、准确率测试
模型训练过程如下:
得到的Loss随迭代次数变化如下:
准确率测试结果为82%
完整代码
from __future__ import print_function, division
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import random
from sklearn.ensemble import RandomForestRegressor
import sklearn.preprocessing as preprocessing
from numpy import array
from sklearn.model_selection import train_test_split
def set_missing_ages(data):#使用RandomForestClassifier填补缺失的年龄属性
age_df = data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
# y --age
y = known_age[:, 0]
# X --feature
X = known_age[:, 1:]
# fit to RandomForestRegressor
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
data.loc[(data.Age.isnull()), 'Age'] = predictedAges
return data
def attribute_to_number(data):
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass')
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex')
data = pd.concat([data, dummies_Pclass,dummies_Embarked, dummies_Sex], axis=1)
data.drop(['Pclass','Sex', 'Embarked'], axis=1, inplace=True)
return data
def Scales(data):
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(data['Age'].reshape(-1, 1))
data['Age_scaled'] = scaler.fit_transform(data['Age'].reshape(-1, 1), age_scale_param)
fare_scale_param = scaler.fit(data['Fare'].reshape(-1, 1))
data['Fare_scaled'] = scaler.fit_transform(data['Fare'].reshape(-1, 1), fare_scale_param)
SibSp_scale_param = scaler.fit(data['SibSp'].reshape(-1, 1))
data['SibSp_scaled'] = scaler.fit_transform(data['SibSp'].reshape(-1, 1), SibSp_scale_param)
Parch_scale_param = scaler.fit(data['Parch'].reshape(-1, 1))
data['Parch_scaled'] = scaler.fit_transform(data['Parch'].reshape(-1, 1), Parch_scale_param)
data.drop(['Parch', 'SibSp', 'Fare', 'Age'], axis=1, inplace=True)
return data
def DataPreProcess(in_data): #数据预处理
in_data.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)
data_ages_fitted = set_missing_ages(in_data) #填补缺失的年龄属性
data = attribute_to_number(data_ages_fitted) #类目属性转化为数值型特征
data_scaled = Scales(data) #数值归一化
#划分特征X,和label Y
data_copy = data_scaled.copy(deep=True)
data_copy.drop(
['Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Sex_female', 'Sex_male',
'Age_scaled', 'Fare_scaled', 'SibSp_scaled', 'Parch_scaled'], axis=1, inplace=True)
data_y = np.array(data_copy)
data_scaled.drop(['Survived'], axis=1, inplace=True)
data_X = np.array(data_scaled)
return data_X,data_y
def LR(data_X,data_y):#tensorflow 实现 Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(data_X,data_y, test_size=0.4, random_state=0)
y_train = tf.concat([1 - y_train, y_train], 1)
y_test = tf.concat([1 - y_test, y_test], 1)
learning_rate = 0.001
training_epochs = 50
batch_size = 50
display_step = 10
n_samples = X_train.shape[0] #sample_num
n_features = X_train.shape[1] #feature_num
n_class = 2
x = tf.placeholder(tf.float32, [None, n_features])
y = tf.placeholder(tf.float32, [None, n_class])
W = tf.Variable(tf.zeros([n_features, n_class]),name="weight")
b = tf.Variable(tf.zeros([n_class]),name="bias")
# predict label
pred = tf.matmul(x, W) + b
# accuracy
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# cross entropy
cost = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
# train
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(n_samples / batch_size)
for i in range(total_batch):
_, c = sess.run([optimizer, cost],
feed_dict={x: X_train[i * batch_size: (i + 1) * batch_size],
y: y_train[i * batch_size: (i + 1) * batch_size, :].eval()})
avg_cost = c / total_batch
plt.plot(epoch + 1, avg_cost, 'co')
if (epoch + 1) % display_step == 0:
print("Epoch:", "%04d" % (epoch + 1), "cost=", avg_cost)
print("Optimization Finished!")
print("Testing Accuracy:", accuracy.eval({x: X_train, y: y_train.eval()}))
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
if __name__ == "__main__":
data = pd.read_csv("/home/yimi/LearnTF/logistic/data/train.csv")
data_X,data_y = DataPreProcess(data)
LR(data_X,data_y)
参考:
[1]https://www.cnblogs.com/zhizhan/p/5238908.html
[2]http://blog.csdn.net/u010099080/article/details/53054519
[3]https://www.cnblogs.com/peghoty/p/3857839.html