数据挖掘作业一:线性回归、逻辑回归和支持向量机

中山大学软件工程数据挖掘第一次作业


github地址:https://github.com/linjiafengyang/DataMining
官方参考答案:Answer to HW1

线性回归

某班主任为了了解本班同学的数学和其他科目考试成绩间关系,在某次阶段性测试中,他在全班学生中随机抽取1个容量为5的样本进行分析。该样本中5位同学的数学和其他科目成绩对应如下表:
这里写图片描述
利用以上数据,建立m与其他变量的多元线性回归方程。

梯度下降法代码如下:
import numpy as np

"""
梯度下降法
"""

# features scaling
def featuresNormalization(x):
    x_mean = np.mean(x, axis=0) # 列均值
    x_max = np.max(x, axis=0) # 列最大值
    x_min = np.min(x, axis=0) # 列最小值
    x_s = x_max - x_min
    x = (x - x_mean) / x_s
    return x, x_mean, x_s

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    x_T = np.transpose(x)
    for i in range(0, numIterations):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        # avg cost function J
        cost = np.sum(loss ** 2) / (2 * m)
        print("Iteration %d | Cost: %f" % (i, cost))
        # avg gradient per example
        gradient = np.dot(x_T, loss) / m
        # update theta
        theta = theta - alpha * gradient
        print("Iteration %d | Theta: %s" % (i, theta))
    return theta

# 数据
Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
X, X_mean, X_s = featuresNormalization(X) # 特征值缩放
Y = np.array([89, 91, 93, 95, 97])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
n = np.alen(X[0])
alpha = 1
theta = np.zeros(n)

print(theta)
print(X)

# 6354可达到与sklearn相同的预测值
theta = gradientDescent(X, Y, theta, alpha, m, 6354)
print("Theta: ", theta)

x_predict = np.array([[88, 73, 87, 92]])
x_predict = (x_predict - X_mean) / X_s
m = np.alen(x_predict)
ones = np.ones(m)
x_predict = np.column_stack((ones, x_predict))
result = np.dot(x_predict, theta)
print("Predit result: %.4f" % result)

运行结果如下:
这里写图片描述

标准方程代码如下:
import numpy as np

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
X_T = np.transpose(X)

# theta = (X'X)^(-1)X'Y
# theta = np.dot(np.dot(np.linalg.inv(np.dot(X_T, X)), X_T), Y)
temp1 = np.dot(X_T, X)
temp2 = np.linalg.inv(temp1)
temp3 = np.dot(temp2, X_T)
theta = np.dot(temp3, Y)
print("Theta: ", theta)

x_predit = [1, 88, 73, 87, 92]
print("Predit result: ", np.dot(x_predit, theta))

运行结果如下:
这里写图片描述

正则化的标准方程代码如下:
import numpy as np

"""
正则化
"""

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))
X_T = np.transpose(X)
lamda = 1
matrix = np.eye(np.alen(X))
matrix[0][0] = 0
# print(matrix)



# 无正则化
# theta = (X'X)^(-1)X'Y
# temp1 = np.dot(X_T, X)

# L2正则化
# theta = (X'X + lamda*matrix)^(-1)X'Y
temp1 = np.dot(X_T, X) + lamda * matrix

temp2 = np.linalg.inv(temp1)
temp3 = np.dot(temp2, X_T)
theta = np.dot(temp3, Y)
print("Theta: ", theta)

x_predit = [1, 88, 73, 87, 92]
print("Predit result: ", np.dot(x_predit, theta))

运行结果如下:
这里写图片描述

scikit-learn线性回归代码如下:
import numpy as np
from sklearn import linear_model

"""
scikit-learn
"""

Y = np.array([89, 91, 93, 95, 97])
X = np.array([[87, 72, 83, 90],
              [89, 76, 88, 93],
              [89, 74, 82, 91],
              [92, 71, 91, 89],
              [93, 76, 89, 94]])
m = np.alen(X)
ones = np.ones(m)
X = np.column_stack((ones, X))

model = linear_model.LinearRegression()
model.fit(X, Y)
x_predict = np.array([[1, 88, 73, 87, 92]])
result = model.predict(x_predict)
print(model.intercept_)
print(model.coef_)
print("Predit result: ", result)

运行结果如下:
这里写图片描述


逻辑回归

研究人员对使用雌激素与子宫内膜癌发病间的关系进行了1:1配对的病例对照研究。病例与对照按年龄相近、婚姻状况相同、生活的社区相同进行了配对。收集了年龄、雌激素药使用、胆囊病史、高血压和非雌激素药使用的数据。变量定义及具体数据如下:
match:配比组
case:case=1病例;case=0对照(未发病)
est:est=1使用过雌激素;est=0未使用雌激素;
gall:gall=1有胆囊病史;gall=0无胆囊病史;
hyper:hyper=1有高血压;hyper=0无高血压;
nonest:nonest=1使用过非雌激素;nonest=0未使用过非雌激素;
表格略,表格的数据已在代码中体现。

简单逻辑回归代码如下:
import math
import numpy as np
from sklearn.cross_validation import train_test_split

def sigmoid(x):
    result = 1 / (1 + np.exp(-x))
    return result

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    x_T = np.transpose(x)
    y_T = np.transpose(y)
    for i in range(0, numIterations):
        hypothesis = sigmoid(np.dot(x, theta))
        loss = hypothesis - y
        # avg cost function J
        cost = 0 - (np.sum(np.dot(y_T, np.log(hypothesis)) + 
                np.dot(1 - y_T, 1 - np.log(hypothesis)))) / m
        print("Iteration %d | Cost: %f" % (i, cost))
        
  • 1
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值