机器学习（回归）

最新推荐文章于 2024-07-14 10:05:28 发布

慢慢ss

最新推荐文章于 2024-07-14 10:05:28 发布

阅读量882

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/adamyouyou/article/details/89432428

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

参考：https://www.cnblogs.com/futurehau/p/6105011.html
模型性能评估：https://www.cnblogs.com/aviator999/p/10049646.html
一、公式：
1.正态分布又称高斯分布
正态分布：
在这里插入图片描述
2.对数计算公式：

回归不是单一的算法：用于处理连续型的数据
分类：离散型数据
1.基本的线性回归（Basic Regression Model）

2.广义的线性回归(GLM:Generalized Linear Model):所谓的广义的线性回归Z=WX+b，f(Z)=predict（y）f(Z)为连接函数
在这里插入图片描述
一、线性回归
1.线性回归的定义及计算

为什么用最小二乘法来评估损失函数？推导可以通过最大似然估计MLE，假设x1,X2,…Xn之间独立的。则：

为了更好的拟合，防止过拟合加入了正则项

线性回归(Linear regression)是利用回归方程(函数)对一个或多个自变量(特征值)和因变量(目标值)之间关系进行建模的一种分析方式。

特点：只有一个自变量的情况称为单变量回归，大于一个自变量情况的叫做多元回归。
Y=WX+b，w为权重，b为偏置（截距）
偏置（意义：为了将各种分布考虑进去）
X为m维的矩阵，Y、b维n维的矩阵，W为mn维的矩阵（其中在R语言中的矩阵的成必须为n？和？m相乘）
我们有了X特征值和Y目标值，需要求出W权重和b截距，可以利用**最小二乘法（OLS）**最小二乘法的原理就是预测值与真实值之差的平方和最小（损失函数）
其实平方是一条抛物线，抛物线最低点，这时我们就可以求导=0求出极值。
W=cov(x,y)/Var(x) 其中cov(x,y)为协方差，Var为方差
a=y-wx
则残差a可以通过
a=mean(y)-wmean(x)求出
2.判定线性关系的相关性：相关系数（Correlations）最常见的是皮尔逊相关系数
Px,y=Corr(x,y)=cov(x,y)/sd(x)sd(y)
3.优化算法
目的：优化权重和偏置，找到最小损失对应的W值，但是即使找到最优的W值还是有损失的
（1）正规方程
公式：
在这里插入图片描述
之所以利用到矩阵的逆是因为矩阵没有除法运算
其实就相当于Y/X=(1/X^2)*(XY)求出权重
用r语言的函数实现：

 reg <- function(x,y){x <- as.matrix(x)
x <- cbind(Intercept=1,x)
# solve求矩阵的逆，%*%两个矩阵相乘，t转置
w <- solve(t(x) %*% x) %*% t(x) %*% y
colnames(w) <- "estimate"
print(w)
}

（2）梯度下降
二、广义的线性回归：逻辑回归
逻辑回归：解决二分类问题：如果是解决多分类问题，可以利用逻辑回归做多个二分类或者利用softmax回归
应用场景：
•广告点击率
•是否为垃圾邮件
•是否患病
•金融诈骗
•虚假账号
逻辑回归的输入就是线性回归，连接函数是sigmoid函数
在这里插入图片描述
逻辑回归的公式：
Sigmoid函数的阈值为0.5，我们在构造一个分类模型时候，设置小于0.5结果为A类大于0.5的为B类，

逻辑回归sklearn接口
参考：https://www.cnblogs.com/pinard/p/6035872.html

在scikit-learn中，与逻辑回归有关的主要是这3个类。LogisticRegression， LogisticRegressionCV 和logistic_regression_path。其中LogisticRegression和LogisticRegressionCV的主要区别是LogisticRegressionCV使用了交叉验证来选择正则化系数C。而LogisticRegression需要自己每次指定一个正则化系数。除了交叉验证，以及选择正则化系数C以外， LogisticRegression和LogisticRegressionCV的使用方法基本相同。

3.回归的重要指标R^2(R squared)
R平方为回归平方和与总离差平方和的比值，表示总离差平方和中可以由回归平方和解释的比例，这一比例越大越好，模型越精确，回归效果越显著。R平方介于0~1之间，越接近1，回归拟合效果越好，一般认为超过0.8的模型拟合优度比较高。
三、R语言的回归实现

insurance <- read.csv("F:\\insurance.csv")
str(insurance)
summary(insurance$expenses)
hist(insurance$expenses)
table(insurance$region)
#浏览成员之间相关性生成相关矩阵
cor(insurance[c('age','bmi','children')])
ins_model <- lm(insurance$expense ~ .,data=insurance)
summary(ins_model)

输出信息

> insurance <- read.csv("F:\\insurance.csv")
> str(insurance)
'data.frame':	1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ expenses: num  16885 1726 4449 21984 3867 ...
> summary(insurance$expenses)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1122    4740    9382   13270   16640   63770 
> table(insurance$region)

northeast northwest southeast southwest 
      324       325       364       325 
> #浏览成员之间相关性生成相关矩阵
> cor(insurance[c('age','bmi','children')])
              age        bmi   children
age      1.000000 0.10934101 0.04246900
bmi      0.109341 1.00000000 0.01264471
children 0.042469 0.01264471 1.00000000
> ins_model <- lm(insurance$expense ~ .,data=insurance)
> summary(ins_model)

Call:
lm(formula = insurance$expense ~ ., data = insurance)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.299e-11 -1.680e-12 -4.300e-13  8.400e-13  7.183e-10 

Coefficients:
                  Estimate Std. Error    t value Pr(>|t|)    
(Intercept)      4.162e-11  3.416e-12  1.218e+01  < 2e-16 ***
age             -9.013e-13  4.539e-14 -1.985e+01  < 2e-16 ***
sexmale          4.778e-13  1.093e-12  4.370e-01 0.662075    
bmi             -1.224e-12  9.873e-14 -1.239e+01  < 2e-16 ***
children        -1.700e-12  4.544e-13 -3.741e+00 0.000191 ***
smokeryes       -8.316e-11  2.540e-12 -3.274e+01  < 2e-16 ***
regionnorthwest  1.278e-12  1.564e-12  8.170e-01 0.414012    
regionsoutheast  3.686e-12  1.574e-12  2.341e+00 0.019364 *  
regionsouthwest  3.442e-12  1.571e-12  2.190e+00 0.028666 *  
expenses         1.000e+00  9.005e-17  1.110e+16  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.99e-11 on 1328 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 5.501e+31 on 9 and 1328 DF,  p-value: < 2.2e-16

模型调优

#模型调优 分析特征值
insurance$bmi30 <- ifelse(insurance$bmi>=30,1,0)
insurance$age2 <- insurance$age^2
ins_model2 <- lm(insurance$expenses ~insurance$age+insurance$age2+insurance$children
                 +insurance$bmi+insurance$sex+bmi30*smoker+insurance$region,data=insurance)
summary(ins_model2)

输出信息

> ins_model2 <- lm(insurance$expenses ~insurance$age+insurance$age2+insurance$children
+                  +insurance$bmi+insurance$sex+bmi30*smoker+insurance$region,data=insurance)
> summary(ins_model2)

Call:
lm(formula = insurance$expenses ~ insurance$age + insurance$age2 + 
    insurance$children + insurance$bmi + insurance$sex + bmi30 * 
    smoker + insurance$region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-17297.1  -1656.0  -1262.7   -727.8  24161.6 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 139.0053  1363.1359   0.102 0.918792    
insurance$age               -32.6181    59.8250  -0.545 0.585690    
insurance$age2                3.7307     0.7463   4.999 6.54e-07 ***
insurance$children          678.6017   105.8855   6.409 2.03e-10 ***
insurance$bmi               119.7715    34.2796   3.494 0.000492 ***
insurance$sexmale          -496.7690   244.3713  -2.033 0.042267 *  
bmi30                      -997.9355   422.9607  -2.359 0.018449 *  
smokeryes                 13404.5952   439.9591  30.468  < 2e-16 ***
insurance$regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
insurance$regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
insurance$regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
bmi30:smokeryes           19810.1534   604.6769  32.762  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4445 on 1326 degrees of freedom
Multiple R-squared:  0.8664,	Adjusted R-squared:  0.8653 
F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

四、python实现
五、模型性能评估
1.均方误差(MSE)
2.决定系数R^2
在这里插入图片描述
模型的score即为R^2

print ('R2 = ', liner.score(x_test, y_test))
#或者可以通过求均方误差和R^2
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("MSE of train: %.2f, test, %.2f" % (
                    mean_squared_error(y_train, y_train_pred), 
                    mean_squared_error(y_test, y_test_pred)))

print("R^2 of train: %.2f, test, %.2f" % (
                    r2_score(y_train, y_train_pred), 
                    r2_score(y_test, y_test_pred)))

六、附加：
输出线性的系数和截距

print(liner.coef_,liner.intercept_)

七、为了防止过拟合加入正则项
L1正则：代表LASSO回归
L2正则：代表:RIDGE岭回归

慢慢ss

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习（回归）

回归不是单一的算法：用于处理连续型的数据1.基本的线性回归（Basic Regression Model）2.广义的线性回归(GLM:Generalized Linear Model):所谓的广义的线性回归Z=WX+b，f(Z)=predict（y）f(Z)为连接函数一、线性回归线性回归(Linear regression)是利用回归方程(函数)对一个或多个自变量(特征值)和因变量(目标值...
复制链接

扫一扫

专栏目录