L2打卡学习笔记

数据预处理

导入数据集

import numpy  as np
import pandas as pd
dataset = pd.read_csv(r'C:\Users\Robert\Desktop\MyWork\PytorchProject\klearning\L1\Data.csv')
dataset
CountryAgeSalaryPurchased
0France44.072000.0No
1Spain27.048000.0Yes
2Germany30.054000.0No
3Spain38.061000.0No
4Germany40.0NaNYes
5France35.058000.0Yes
6SpainNaN52000.0No
7France48.079000.0Yes
8Germany50.083000.0No
9France37.067000.0Yes
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
X,Y
(array([['France', 44.0, 72000.0],
        ['Spain', 27.0, 48000.0],
        ['Germany', 30.0, 54000.0],
        ['Spain', 38.0, 61000.0],
        ['Germany', 40.0, nan],
        ['France', 35.0, 58000.0],
        ['Spain', nan, 52000.0],
        ['France', 48.0, 79000.0],
        ['Germany', 50.0, 83000.0],
        ['France', 37.0, 67000.0]], dtype=object),
 array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
       dtype=object))

处理丢失数据

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer = imputer.fit(X[ : , 1:3])

X[ : , 1:3] = imputer.transform(X[ : , 1:3])
X
array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

进行Label编码

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()
X[ : , 0]      = labelencoder_X.fit_transform(X[ : , 0])
X
array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

拆分训练集和测试集

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size = 0.2,
                                                    random_state = 0)

特征标准化



from sklearn.preprocessing import StandardScaler

sc_X    = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test  = sc_X.transform(X_test)

SimpleImputer总结

sklearn.impute.SimpleImputer 是 Scikit-learn 库中的一个类,用于处理数据集中缺失值的插补。它通过替换缺失值为统计值(例如均值、中位数或众数)或指定的常数来处理缺失数据。

用均值替换缺失值

import numpy as np
from sklearn.impute import SimpleImputer

# 创建数据集,其中包含缺失值
X = [[1, 2], [np.nan, 3], [7, 6], [4, np.nan]]

# 创建 SimpleImputer 对象,指定用均值替换缺失值
imputer = SimpleImputer(strategy='mean')

# 训练 imputer 并转换数据
X_imputed = imputer.fit_transform(X)

print(X_imputed)

用常数替换缺失值

# 创建 SimpleImputer 对象,指定用常数 -1 替换缺失值
imputer = SimpleImputer(strategy='constant', fill_value=-1)

# 训练 imputer 并转换数据
X_imputed = imputer.fit_transform(X)

print(X_imputed)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv(r'C:\Users\Robert\Desktop\MyWork\PytorchProject\klearning\learning_base\studentscores.csv')
X = dataset.iloc[ : , :1].values
Y = dataset.iloc[ : ,1].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=1/4,
                                                    random_state=0)
dataset
HoursScores
02.521
15.147
23.227
38.575
43.530
51.520
69.288
75.560
88.381
92.725
107.785
115.962
124.541
133.342
141.117
158.995
162.530
171.924
186.167
197.469
202.730
214.854
223.835
236.976
247.886
259.193
269.293
279.593

train_test_split():将数据集划分为测试集与训练集。

  • X:所要划分的整体数据的特征集;
  • Y:所要划分的整体数据的结果;
  • test_size:测试集数据量在整体数据量中的占比(可以理解为X_test与X的比值);
  • random_state:
    若不填或者填0,每次生成的数据都是随机,可能不一样。
    若为整数,每次生成的数据都相同;

线性回归模型

一元线性回归方程: Y = a X + b Y=aX+b Y=aX+b

  • sklearn.linear_model包实现了广义线性模型,包括线性回归、Ridge回归、Bayesian回归等。LinearRegression是其中较为简单的线性回归模型。

简单线性回归模型

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)

预测结果

Y_pred = regressor.predict(X_test)

可视化

plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
# 训练集可视化
plt.show()

在这里插入图片描述

plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, regressor.predict(X_test), color='blue')
# 测试集可视化
plt.show()

在这里插入图片描述

# 鸢尾花数据集导入
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset
花萼-length花萼-width花瓣-length花瓣-widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
..................
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

150 rows × 5 columns

import pandas as pd
import numpy as np
# 鸢尾花数据集导入
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset
花萼-length花萼-width花瓣-length花瓣-widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
..................
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

150 rows × 5 columns

多元线性回归模型

多元线性回归方程: Y = a X 1 + b X 2 + c X 3 + . . . . . . . + n X n Y =aX_1+bX_2+cX_3+.......+nX_n Y=aX1+bX2+cX3+.......+nXn

导入数据

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset

数据分析

import matplotlib.pyplot as plt

plt.plot(dataset['花萼-length'], dataset['花瓣-width'], 'x', label="marker='x'")
plt.plot(dataset['花萼-width'],  dataset['花瓣-width'], 'o', label="marker='o'")
plt.plot(dataset['花瓣-length'], dataset['花瓣-width'], 'v', label="marker='v'")

plt.legend(numpoints=1)
plt.show()

X = dataset.iloc[ : ,:-1].values
str_set = set(list(dataset.iloc[ : ,  -1 ].values))
# 使用字典推导式将集合转换为字典
my_dict = {value:index  for index, value in enumerate(str_set)}
Y =np.array([my_dict[item] for item in dataset.iloc[ : ,  -1 ].values])
X,Y
(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

构建训练集、测试集

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                    random_state=0)
X_test,Y_test

(array([[5.8, 2.8, 5.1, 2.4],
        [6. , 2.2, 4. , 1. ],
        [5.5, 4.2, 1.4, 0.2],
        [7.3, 2.9, 6.3, 1.8],
        [5. , 3.4, 1.5, 0.2],
        [6.3, 3.3, 6. , 2.5],
        [5. , 3.5, 1.3, 0.3],
        [6.7, 3.1, 4.7, 1.5],
        [6.8, 2.8, 4.8, 1.4],
        [6.1, 2.8, 4. , 1.3],
        [6.1, 2.6, 5.6, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.5, 2.8, 4.6, 1.5],
        [6.1, 2.9, 4.7, 1.4],
        [4.9, 3.1, 1.5, 0.1],
        [6. , 2.9, 4.5, 1.5],
        [5.5, 2.6, 4.4, 1.2],
        [4.8, 3. , 1.4, 0.3],
        [5.4, 3.9, 1.3, 0.4],
        [5.6, 2.8, 4.9, 2. ],
        [5.6, 3. , 4.5, 1.5],
        [4.8, 3.4, 1.9, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [6.2, 2.8, 4.8, 1.8],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.8, 1.9, 0.4],
        [6.2, 2.9, 4.3, 1.3],
        [5. , 2.3, 3.3, 1. ],
        [5. , 3.4, 1.6, 0.4]]),
 array([0, 1, 2, 0, 2, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 2, 2, 0, 1,
        2, 2, 0, 2, 2, 1, 1, 2]))

训练多元线性回归模型

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, Y_train)

在测试集上预测结果

y_pred = regressor.predict(X_test)
y_pred
array([-0.06703909,  1.03511926,  2.14594224,  0.18626863,  2.03780827,
       -0.26136881,  2.02684168,  0.6804437 ,  0.71699886,  0.88711034,
        0.45090173,  0.69846035,  0.7872897 ,  0.67013112,  0.66952456,
        2.07605449,  0.64392345,  0.78436045,  1.9626113 ,  2.02449662,
        0.20114236,  0.6053914 ,  1.92471089,  1.97696001,  0.40966628,
        2.11806588,  1.85084934,  0.8328718 ,  1.09349115,  1.89308423])

可视化

plt.scatter(Y_test,y_pred, color='red')

plt.xlabel("Prediction")
plt.ylabel("True")

plt.show()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值