# 机器学习逻辑回归模型总结——从原理到sklearn实践

8298人阅读 评论(3)

## 0x00 基本原理

y{0,1,2,3,...,n}
，显然不能使用线性回归拟合。

hθ(x)=11+eθTx

P(y=0|x;θ)+P(y=1|x;θ)=1

J(θ)=1mi=1m12(hθ(x(i))y(i))2

## 0x01 算法实现

function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1+exp(-z));
end

Cost Function的实现：

function [J, grad] = costFunction(theta, X, y)
% 初始化
m = length(y);
J = 0;

% 损失函数的计算
temp = sigmoid(X*theta);
temp = temp(:,size(temp, 2));
J = (1/m) * sum((-y.*log(temp))-((1-y).*log(1-temp))) ;

% 损失函数的导数计算
for i=1:size(theta,1),
grad(i) = (1/m) * sum((temp - y).*X(:,i));
end;
end

function p = predict(theta, X)
m = size(X, 1);
p = zeros(m, 1);

% 计算类别，使用p向量返回
for i=1:m,
prop = sigmoid(X(i,:)*theta) ;
if prop >= 0.5,
p(i) = 1;
end;
end;
end;

## 0x03 sklearn库实践

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve, roc_curve, auc

data = pd.read_csv('ex2data1.txt', sep=',', \
skiprows=[2], names=['score1','score2','result'])
score_data = data.loc[:,['score1','score2']]
result_data = data.result

p = 0
for i in xrange(10):
x_train, x_test, y_train, y_test = \
train_test_split(score_data, result_data, test_size = 0.2)
model = LogisticRegression(C=1e9)
model.fit(x_train, y_train)
predict_y = model.predict(x_test)
p += np.mean(predict_y == y_test)

# 绘制图像
pos_data = data[data.result == 1].loc[:,['score1','score2']]
neg_data = data[data.result == 0].loc[:,['score1','score2']]

h = 0.02
x_min, x_max = score_data.loc[:, ['score1']].min() - .5, score_data.loc[:, ['score1']].max() + .5
y_min, y_max = score_data.loc[:, ['score2']].min() - .5, score_data.loc[:, ['score2']].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

# 绘制边界和散点
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.scatter(x=pos_data.score1, y=pos_data.score2, color='black', marker='o')
plt.scatter(x=neg_data.score1, y=neg_data.score2, color='red', marker='*')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()

# 模型表现
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names = ['neg', 'pos']))
print("average precision:", p/100)  

               precision    recall  f1-score   support

neg       0.88      0.88      0.88         8
pos       0.92      0.92      0.92        12

avg / total       0.90      0.90      0.90        20
('average precision:', 0.089999999999999997)

## 0x04 总结

2
0

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：249647次
• 积分：3084
• 等级：
• 排名：第13024名
• 原创：65篇
• 转载：13篇
• 译文：0篇
• 评论：71条
乌云ID：隐形人真忙@91ri.org

#### Keywords:

• Web安全研究
• 安全与渗透测试相关程序开发
• 狂热的pyer
• 最近的兴趣点：Java中间件安全
欢迎与我讨论。

访问微博

#### GitHub:

https://github.com/OneSourceCat

## 插播广告:

想买VPS的小伙伴，可以使用我的vultr链接： vultr虚拟主机
送20美刀哦~
评论排行
最新评论