机器学习-初级入门（回归算法）

最新推荐文章于 2022-04-04 21:01:49 发布

背后——NULL

最新推荐文章于 2022-04-04 21:01:49 发布

阅读量315

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_41433183/article/details/104317340

版权

机器学习专栏收录该内容

16 篇文章 0 订阅

订阅专栏

一、回归算法

简单线性回归
数据

YearsExperience    Salary
           1.1   39343.0
           1.3   46205.0
           1.5   37731.0
           2.0   43525.0
           2.2   39891.0
           2.9   56642.0
           ...

原理在这里插入图片描述

import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv("Salary_Data.csv")
X = dataset.iloc[:, :-1].values  # 自变量
y = dataset.iloc[:, 1].values  # 因变量

# 划分数据集， 训练数据集， 测试训练集(比重一般不超过0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# 特征缩放

# 对训练集进行线性回归拟合
regressor = LinearRegression()
regressor.fit(X_train, y_train)  # 训练集拟合回归器

# 预测测试集结果
y_pred = regressor.predict(X_test)

# 画图
fig = plt.figure(figsize=(8, 4))  # 规定画图域大小，宽4高5
train_ax = fig.add_subplot(1, 2, 1)  # 显示2行3列第1个图
test_ax = fig.add_subplot(1, 2, 2)
train_ax.scatter(X_train, y_train, color="red")
train_ax.plot(X_train, regressor.predict(X_train), color="blue")  # 线性回归的图像
train_ax.set_title("Salary VS Experience (training set)")
train_ax.set_xlabel("Years of Experience")
train_ax.set_ylabel("Salary")

test_ax.scatter(X_test, y_test, color="red")
test_ax.plot(X_train, regressor.predict(X_train), color="blue")  # 线性回归的图像
test_ax.set_title("Salary VS Experience (training set)")
test_ax.set_xlabel("Years of Experience")
test_ax.set_ylabel("Salary")
plt.show()

结果：
在这里插入图片描述

多元线性回归(y = b0 + b1x1 + b2x2 + … + bn*xn, 利用多种建模方式尽量减少自变量)

原理
在这里插入图片描述
判断多元线性回归条件

在这里插入图片描述
将分类数据转化为虚拟数字（California可不表示）

建立模型-信息量比较

建立模型-双向淘汰

建立模型-反向淘汰(推荐)

建立模型-顺向选择

数据

    R&D Spend  Administration  Marketing Spend       State     Profit
   165349.20       136897.80        471784.10    New York  192261.83
   162597.70       151377.59        443898.53  California  191792.06
   153441.51       101145.55        407934.54     Florida  191050.39
   144372.41       118671.85        383199.62    New York  182901.99
   142107.34        91391.77        366168.42     Florida  166187.94
   131876.90        99814.71        362861.36    New York  156991.12
   ...

import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
dataset = pd.read_csv("50_Startups.csv")

-----------利用all-in的方法建立模型------------------

X = dataset.iloc[:, :-1].values  # 自变量
y = dataset.iloc[:, 4].values  # 因变量

# 分类数据
label_encoder_X = LabelEncoder()
X[:, 3] = label_encoder_X.fit_transform(X[:, 3])
one_hot_encoder = OneHotEncoder(categorical_features=[3])
X = one_hot_encoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# 划分数据集， 训练数据集， 测试训练集(比重一般不超过0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 建立回归器
regression = LinearRegression()
regression.fit(X_train, y_train)

# 测试测试集
y_pred = regression.predict(X_test)


-----------利用反向淘汰的方法建立模型------------------

# y = b0 + b1*x1 + b2*x2 + ... + bn*xn
dataset = pd.read_csv("50_Startups.csv")
X = dataset.iloc[:, :-1].values  # 自变量
y = dataset.iloc[:, 4].values  # 因变量

# 分类数据
label_encoder_X = LabelEncoder()
X[:, 3] = label_encoder_X.fit_transform(X[:, 3])
one_hot_encoder = OneHotEncoder(categorical_features=[3])
X = one_hot_encoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# 划分数据集， 训练数据集， 测试训练集(比重一般不超过0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train = np.append(arr=np.ones(shape=(40, 1)), values=X_train, axis=1)  # 给训练集加上一列1相当于b0 = 1

# #######反向淘汰########
# 1. 剔除p-Value最高的值
# X_opt = X_train[:, [0, 1, 2, 3, 4, 5]]
# regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
# print(regression_OLS.summary())  # 列出拟合模型的详细信息

# X_opt = X_train[:, [0, 1, 3, 4, 5]]
# regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
# print(regression_OLS.summary())  # 列出拟合模型的详细信息

# X_opt = X_train[:, [0, 3, 4, 5]]
# regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
# print(regression_OLS.summary())  # 列出拟合模型的详细信息

# X_opt = X_train[:, [0, 3, 5]]
# regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
# print(regression_OLS.summary())  # 列出拟合模型的详细信息

X_opt = X_train[:, [0, 3]]
regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
print(regression_OLS.summary())  # 列出拟合模型的详细信息

建模方式

All-in, 反向淘汰(推荐), 顺向选择, 双向淘汰, 信息量比较

多项式回归(y = b0 + b1x + b2x^2 …)

R平方
在这里插入图片描述
广义R平方

在这里插入图片描述
反向选择四次分析的例子理解R平方的影响

数据

           Position  Level   Salary
   Business Analyst      1    45000
  Junior Consultant      2    50000
  Senior Consultant      3    60000
            Manager      4    80000
    Country Manager      5   110000
     Region Manager      6   150000
     ...

import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# 拟合线性模型
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# 拟合多元线性模型
poly_reg = PolynomialFeatures(degree=4)  # degree次数为5, 修改它可以改变拟合程度
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

# 画图
X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape(len(X_grid), 1)
fig = plt.figure(figsize=(8, 4))  # 规定画图域大小，宽4高5
lin_reg_ax = fig.add_subplot(1, 2, 1)  # 显示2行3列第1个图
lin_reg2_ax = fig.add_subplot(1, 2, 2)
lin_reg_ax.scatter(X, y, color="red")
lin_reg_ax.plot(X_grid, lin_reg.predict(X_grid), color="blue")  # 线性回归的图像
lin_reg_ax.set_title("Truth VS Bluff(Linear Regression)")
lin_reg_ax.set_xlabel("Position Level")
lin_reg_ax.set_ylabel("Salary")

lin_reg2_ax.scatter(X, y, color="red")
lin_reg2_ax.plot(X_grid, lin_reg_2.predict(poly_reg.fit_transform(X_grid)), color="blue")  # 线性回归的图像
lin_reg2_ax.set_title("Truth VS Bluff (Polynomal Regression)")
lin_reg2_ax.set_xlabel("Position Level")
lin_reg2_ax.set_ylabel("Salary")
plt.show()

结果：

在这里插入图片描述

背后——NULL

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
机器学习-初级入门（回归算法）

一、回归算法简单线性回归数据YearsExperience Salary 1.1 39343.0 1.3 46205.0 1.5 37731.0 2.0 43525.0 2.2 39891.0 2.9 56642.0 ...
复制链接

扫一扫

专栏目录