数据挖掘作业一：线性回归、逻辑回归和支持向量机

最新推荐文章于 2024-06-11 17:16:57 发布

linjiafengyang

最新推荐文章于 2024-06-11 17:16:57 发布

阅读量3.6k

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/sinat_31790817/article/details/79884618

版权

中山大学软件工程数据挖掘第一次作业

github地址：https://github.com/linjiafengyang/DataMining
官方参考答案：Answer to HW1

线性回归

某班主任为了了解本班同学的数学和其他科目考试成绩间关系，在某次阶段性测试中，他在全班学生中随机抽取1个容量为5的样本进行分析。该样本中5位同学的数学和其他科目成绩对应如下表：
这里写图片描述
利用以上数据，建立m与其他变量的多元线性回归方程。

梯度下降法代码如下：

import numpy as np

"""
梯度下降法
"""

# features scaling
def featuresNormalization(x):
    x_mean = np.mean(x, axis=0) # 列均值
    x_max = np.max(x, axis=0) # 列最大值
    x_min = np.min(x, axis=0) # 列最小值
    x_s = x_max - x_min
    x = (x - x_mean) / x_s
    return x, x_mean, x_s

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    x_T = np.transpose(x)
    for i in range(0, numIterations):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        # avg cost function J
        cost = np.sum(loss ** 2) / (2 * m)
        print("Iteration %d | Cost: %f" % (i, cost))
        # avg gradient per example
        gradient = np.dot(x_T, loss) / m
        # update theta
        theta = theta - alpha * gradient
        print("Iteration %d | Theta: %s" % (i, theta))
    return theta

# 数据
Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
X, X_mean, X_s = featuresNormalization(X) # 特征值缩放
Y = np.array([89, 91, 93, 95, 97])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
n = np.alen(X[0])
alpha = 1
theta = np.zeros(n)

print(theta)
print(X)

# 6354可达到与sklearn相同的预测值
theta = gradientDescent(X, Y, theta, alpha, m, 6354)
print("Theta: ", theta)

x_predict = np.array([[88, 73, 87, 92]])
x_predict = (x_predict - X_mean) / X_s
m = np.alen(x_predict)
ones = np.ones(m)
x_predict = np.column_stack((ones, x_predict))
result = np.dot(x_predict, theta)
print("Predit result: %.4f" % result)

运行结果如下：
这里写图片描述

标准方程代码如下：

import numpy as np

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
X_T = np.transpose(X)

# theta = (X'X)^(-1)X'Y
# theta = np.dot(np.dot(np.linalg.inv(np.dot(X_T, X)), X_T), Y)
temp1 = np.dot(X_T, X)
temp2 = np.linalg.inv(temp1)
temp3 = np.dot(temp2, X_T)
theta = np.dot(temp3, Y)
print("Theta: ", theta)

x_predit = [1, 88, 73, 87, 92]
print("Predit result: ", np.dot(x_predit, theta))

运行结果如下：
这里写图片描述

正则化的标准方程代码如下：

import numpy as np

"""
正则化
"""

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
X_T = np.transpose(X)
lamda = 1
matrix = np.eye(np.alen(X))
matrix[0][0] = 0
# print(matrix)



# 无正则化
# theta = (X'X)^(-1)X'Y
# temp1 = np.dot(X_T, X)

# L2正则化
# theta = (X'X + lamda*matrix)^(-1)X'Y
temp1 = np.dot(X_T, X) + lamda * matrix

temp2 = np.linalg.inv(temp1)
temp3 = np.dot(temp2, X_T)
theta = np.dot(temp3, Y)
print("Theta: ", theta)

x_predit = [1, 88, 73, 87, 92]
print("Predit result: ", np.dot(x_predit, theta))

运行结果如下：
这里写图片描述

scikit-learn线性回归代码如下：

import numpy as np
from sklearn import linear_model

"""
scikit-learn
"""

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))

model = linear_model.LinearRegression()
model.fit(X, Y)
x_predict = np.array([[1, 88, 73, 87, 92]])
result = model.predict(x_predict)
print(model.intercept_)
print(model.coef_)
print("Predit result: ", result)

运行结果如下：
这里写图片描述

逻辑回归

研究人员对使用雌激素与子宫内膜癌发病间的关系进行了1:1配对的病例对照研究。病例与对照按年龄相近、婚姻状况相同、生活的社区相同进行了配对。收集了年龄、雌激素药使用、胆囊病史、高血压和非雌激素药使用的数据。变量定义及具体数据如下：
match：配比组
case：case=1病例；case=0对照（未发病）
est：est=1使用过雌激素；est=0未使用雌激素；
gall：gall=1有胆囊病史；gall=0无胆囊病史；
hyper：hyper=1有高血压；hyper=0无高血压；
nonest：nonest=1使用过非雌激素；nonest=0未使用过非雌激素；
表格略，表格的数据已在代码中体现。

简单逻辑回归代码如下：

import math
import numpy as np
from sklearn.cross_validation import train_test_split

def sigmoid(x):
    result = 1 / (1 + np.exp(-x))
    return result

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    x_T = np.transpose(x)
    y_T = np.transpose(y)
    for i in range(0, numIterations):
        hypothesis = sigmoid(np.dot(x, theta))
        loss = hypothesis - y
        # avg cost function J
        cost = 0 - (np.sum(np.dot(y_T, np.log(hypothesis)) + 
                np.dot(1 - y_T, 1 - np.log(hypothesis)))) / m
        print("Iteration %d | Cost: %f" % (i, cost))

最低0.47元/天解锁文章

linjiafengyang

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
数据挖掘作业一：线性回归、逻辑回归和支持向量机

中山大学软件工程数据挖掘第一次作业github地址：https://github.com/linjiafengyang/DataMining线性回归某班主任为了了解本班同学的数学和其他科目考试成绩间关系，在某次阶段性测试中，他在全班学生中随机抽取1个容量为5的样本进行分析。该样本中5位同学的数学和其他科目成绩对应如下表：利用以上数据，建立m与其他变量的多元线性回归方程。...
复制链接

扫一扫