一、回归算法
-
简单线性回归
数据YearsExperience Salary 1.1 39343.0 1.3 46205.0 1.5 37731.0 2.0 43525.0 2.2 39891.0 2.9 56642.0 ...
原理
import warnings with warnings.catch_warnings(): warnings.filterwarnings("ignore", category=DeprecationWarning) from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt import pandas as pd dataset = pd.read_csv("Salary_Data.csv") X = dataset.iloc[:, :-1].values # 自变量 y = dataset.iloc[:, 1].values # 因变量 # 划分数据集, 训练数据集, 测试训练集(比重一般不超过0.4) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # 特征缩放 # 对训练集进行线性回归拟合 regressor = LinearRegression() regressor.fit(X_train, y_train) # 训练集拟合回归器 # 预测测试集结果 y_pred = regressor.predict(X_test) # 画图 fig = plt.figure(figsize=(8, 4)) # 规定画图域大小,宽4高5 train_ax = fig.add_subplot(1, 2, 1) # 显示2行3列第1个图 test_ax = fig.add_subplot(1, 2, 2) train_ax.scatter(X_train, y_train, color="red") train_ax.plot(X_train, regressor.predict(X_train), color="blue") # 线性回归的图像 train_ax.set_title("Salary VS Experience (training set)") train_ax.set_xlabel("Years of Experience") train_ax.set_ylabel("Salary") test_ax.scatter(X_test, y_test, color="red") test_ax.plot(X_train, regressor.predict(X_train), color="blue") # 线性回归的图像 test_ax.set_title("Salary VS Experience (training set)") test_ax.set_xlabel("Years of Experience") test_ax.set_ylabel("Salary") plt.show()
结果:
-
多元线性回归(y = b0 + b1x1 + b2x2 + … + bn*xn, 利用多种建模方式尽量减少自变量)
原理
判断多元线性回归条件
将分类数据转化为虚拟数字(California可不表示)
建立模型-信息量比较
建立模型-双向淘汰
建立模型-反向淘汰(推荐)
建立模型-顺向选择
数据R&D Spend Administration Marketing Spend State Profit 165349.20 136897.80 471784.10 New York 192261.83 162597.70 151377.59 443898.53 California 191792.06 153441.51 101145.55 407934.54 Florida 191050.39 144372.41 118671.85 383199.62 New York 182901.99 142107.34 91391.77 366168.42 Florida 166187.94 131876.90 99814.71 362861.36 New York 156991.12 ...
import warnings with warnings.catch_warnings(): warnings.filterwarnings("ignore", category=DeprecationWarning) from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression import statsmodels.formula.api as sm import matplotlib.pyplot as plt import pandas as pd import numpy as np dataset = pd.read_csv("50_Startups.csv") -----------利用all-in的方法建立模型------------------ X = dataset.iloc[:, :-1].values # 自变量 y = dataset.iloc[:, 4].values # 因变量 # 分类数据 label_encoder_X = LabelEncoder() X[:, 3] = label_encoder_X.fit_transform(X[:, 3]) one_hot_encoder = OneHotEncoder(categorical_features=[3]) X = one_hot_encoder.fit_transform(X).toarray() # Avoiding the Dummy Variable Trap X = X[:, 1:] # 划分数据集, 训练数据集, 测试训练集(比重一般不超过0.4) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # 建立回归器 regression = LinearRegression() regression.fit(X_train, y_train) # 测试测试集 y_pred = regression.predict(X_test) -----------利用反向淘汰的方法建立模型------------------ # y = b0 + b1*x1 + b2*x2 + ... + bn*xn dataset = pd.read_csv("50_Startups.csv") X = dataset.iloc[:, :-1].values # 自变量 y = dataset.iloc[:, 4].values # 因变量 # 分类数据 label_encoder_X = LabelEncoder() X[:, 3] = label_encoder_X.fit_transform(X[:, 3]) one_hot_encoder = OneHotEncoder(categorical_features=[3]) X = one_hot_encoder.fit_transform(X).toarray() # Avoiding the Dummy Variable Trap X = X[:, 1:] # 划分数据集, 训练数据集, 测试训练集(比重一般不超过0.4) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) X_train = np.append(arr=np.ones(shape=(40, 1)), values=X_train, axis=1) # 给训练集加上一列1相当于b0 = 1 # #######反向淘汰######## # 1. 剔除p-Value最高的值 # X_opt = X_train[:, [0, 1, 2, 3, 4, 5]] # regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit() # print(regression_OLS.summary()) # 列出拟合模型的详细信息 # X_opt = X_train[:, [0, 1, 3, 4, 5]] # regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit() # print(regression_OLS.summary()) # 列出拟合模型的详细信息 # X_opt = X_train[:, [0, 3, 4, 5]] # regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit() # print(regression_OLS.summary()) # 列出拟合模型的详细信息 # X_opt = X_train[:, [0, 3, 5]] # regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit() # print(regression_OLS.summary()) # 列出拟合模型的详细信息 X_opt = X_train[:, [0, 3]] regression_OLS = sm.OLS(endog=y_train, exog=X_opt).fit() print(regression_OLS.summary()) # 列出拟合模型的详细信息
建模方式
All-in, 反向淘汰(推荐), 顺向选择, 双向淘汰, 信息量比较
-
多项式回归(y = b0 + b1x + b2x^2 …)
R平方
广义R平方
反向选择四次分析的例子理解R平方的影响
数据Position Level Salary Business Analyst 1 45000 Junior Consultant 2 50000 Senior Consultant 3 60000 Manager 4 80000 Country Manager 5 110000 Region Manager 6 150000 ...
import warnings with warnings.catch_warnings(): warnings.filterwarnings("ignore", category=DeprecationWarning) from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt import numpy as np import pandas as pd # Importing the dataset dataset = pd.read_csv('Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values # 拟合线性模型 lin_reg = LinearRegression() lin_reg.fit(X, y) # 拟合多元线性模型 poly_reg = PolynomialFeatures(degree=4) # degree次数为5, 修改它可以改变拟合程度 X_poly = poly_reg.fit_transform(X) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y) # 画图 X_grid = np.arange(min(X), max(X), 0.1) X_grid = X_grid.reshape(len(X_grid), 1) fig = plt.figure(figsize=(8, 4)) # 规定画图域大小,宽4高5 lin_reg_ax = fig.add_subplot(1, 2, 1) # 显示2行3列第1个图 lin_reg2_ax = fig.add_subplot(1, 2, 2) lin_reg_ax.scatter(X, y, color="red") lin_reg_ax.plot(X_grid, lin_reg.predict(X_grid), color="blue") # 线性回归的图像 lin_reg_ax.set_title("Truth VS Bluff(Linear Regression)") lin_reg_ax.set_xlabel("Position Level") lin_reg_ax.set_ylabel("Salary") lin_reg2_ax.scatter(X, y, color="red") lin_reg2_ax.plot(X_grid, lin_reg_2.predict(poly_reg.fit_transform(X_grid)), color="blue") # 线性回归的图像 lin_reg2_ax.set_title("Truth VS Bluff (Polynomal Regression)") lin_reg2_ax.set_xlabel("Position Level") lin_reg2_ax.set_ylabel("Salary") plt.show()
结果: