学习目标
与简单线性回归相比,引入更多的独立变量,步骤基本一致:
-
数据预处理
-
对训练集进行线性拟合
-
预测结果
步骤讲解
原始数据编码
由于原始数据中存在字符串,需要先使用Labelenconder方法数字化,再使用OneHotEncoder方法进行编码,得到机器码100、010、001三种状态(对于原始数据三种字符串)。
躲避虚拟变量陷阱
# Avoiding Dummy Variable Trap
x = x[:, 1:]
以上代码所谓虚拟变量陷阱,指定是原始数据中州名(California,Florida,New York)只有三个,理论上2位二进制(可以表示四种状态)就可以表示。这里被编码为100、010、001,将第一位去掉变为00、10、01也可以区分,所以这里直接将第一列数据丢弃。
源代码
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
step = "Step 1: Data Preprocessing"
# Importing the dateset
dataset = pd.read_csv("50_Startups.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
# Encoding Categorial data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
onehotencoder = OneHotEncoder(categorical_features=[3])
x[:, 3] = labelencoder.fit_transform(x[:, 3])
x = onehotencoder.fit_transform(x).toarray()
# Avoiding Dummy Variable Trap
x = x[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
step = "Step 2: Fitting Multiple Linear Regression to the Training set"
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(x_train, y_train)
step = "Step 3: Predicting the Test set results"
y_gred = regressor.predict(x_test)