L2打卡学习笔记

最新推荐文章于 2024-09-13 18:30:21 发布

无涯学徒1998

最新推荐文章于 2024-09-13 18:30:21 发布

阅读量969

点赞数 21

文章标签：学习笔记

本文链接：https://blog.csdn.net/Inface0443/article/details/141916795

版权

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

数据预处理

导入数据集

import numpy  as np
import pandas as pd
dataset = pd.read_csv(r'C:\Users\Robert\Desktop\MyWork\PytorchProject\klearning\L1\Data.csv')
dataset

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values

X,Y

(array([['France', 44.0, 72000.0],
        ['Spain', 27.0, 48000.0],
        ['Germany', 30.0, 54000.0],
        ['Spain', 38.0, 61000.0],
        ['Germany', 40.0, nan],
        ['France', 35.0, 58000.0],
        ['Spain', nan, 52000.0],
        ['France', 48.0, 79000.0],
        ['Germany', 50.0, 83000.0],
        ['France', 37.0, 67000.0]], dtype=object),
 array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
       dtype=object))

处理丢失数据

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer = imputer.fit(X[ : , 1:3])

X[ : , 1:3] = imputer.transform(X[ : , 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

进行Label编码

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()
X[ : , 0]      = labelencoder_X.fit_transform(X[ : , 0])
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

拆分训练集和测试集

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size = 0.2,
                                                    random_state = 0)

特征标准化



from sklearn.preprocessing import StandardScaler

sc_X    = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test  = sc_X.transform(X_test)

SimpleImputer总结

sklearn.impute.SimpleImputer 是 Scikit-learn 库中的一个类，用于处理数据集中缺失值的插补。它通过替换缺失值为统计值（例如均值、中位数或众数）或指定的常数来处理缺失数据。

用均值替换缺失值

import numpy as np
from sklearn.impute import SimpleImputer

# 创建数据集，其中包含缺失值
X = [[1, 2], [np.nan, 3], [7, 6], [4, np.nan]]

# 创建 SimpleImputer 对象，指定用均值替换缺失值
imputer = SimpleImputer(strategy='mean')

# 训练 imputer 并转换数据
X_imputed = imputer.fit_transform(X)

print(X_imputed)

用常数替换缺失值

# 创建 SimpleImputer 对象，指定用常数 -1 替换缺失值
imputer = SimpleImputer(strategy='constant', fill_value=-1)

# 训练 imputer 并转换数据
X_imputed = imputer.fit_transform(X)

print(X_imputed)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv(r'C:\Users\Robert\Desktop\MyWork\PytorchProject\klearning\learning_base\studentscores.csv')
X = dataset.iloc[ : , :1].values
Y = dataset.iloc[ : ,1].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=1/4,
                                                    random_state=0)
dataset

	Hours	Scores
0	2.5	21
1	5.1	47
2	3.2	27
3	8.5	75
4	3.5	30
5	1.5	20
6	9.2	88
7	5.5	60
8	8.3	81
9	2.7	25
10	7.7	85
11	5.9	62
12	4.5	41
13	3.3	42
14	1.1	17
15	8.9	95
16	2.5	30
17	1.9	24
18	6.1	67
19	7.4	69
20	2.7	30
21	4.8	54
22	3.8	35
23	6.9	76
24	7.8	86
25	9.1	93
26	9.2	93
27	9.5	93

train_test_split()：将数据集划分为测试集与训练集。

X：所要划分的整体数据的特征集；
Y：所要划分的整体数据的结果；
test_size：测试集数据量在整体数据量中的占比（可以理解为X_test与X的比值）；
random_state：
若不填或者填0，每次生成的数据都是随机，可能不一样。
若为整数，每次生成的数据都相同；

线性回归模型

一元线性回归方程： $Y = a X + b$ 。

sklearn.linear_model包实现了广义线性模型，包括线性回归、Ridge回归、Bayesian回归等。LinearRegression是其中较为简单的线性回归模型。

简单线性回归模型

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)

预测结果

Y_pred = regressor.predict(X_test)

可视化

plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
# 训练集可视化
plt.show()

在这里插入图片描述

plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, regressor.predict(X_test), color='blue')
# 测试集可视化
plt.show()

在这里插入图片描述

# 鸢尾花数据集导入
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset

	花萼-length	花萼-width	花瓣-length	花瓣-width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

import pandas as pd
import numpy as np
# 鸢尾花数据集导入
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset

	花萼-length	花萼-width	花瓣-length	花瓣-width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

多元线性回归模型

多元线性回归方程： $Y =aX_1+bX_2+cX_3+.......+nX_n$ 。

导入数据

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']

dataset = pd.read_csv(url, names=names)
dataset

数据分析

import matplotlib.pyplot as plt

plt.plot(dataset['花萼-length'], dataset['花瓣-width'], 'x', label="marker='x'")
plt.plot(dataset['花萼-width'],  dataset['花瓣-width'], 'o', label="marker='o'")
plt.plot(dataset['花瓣-length'], dataset['花瓣-width'], 'v', label="marker='v'")

plt.legend(numpoints=1)
plt.show()

X = dataset.iloc[ : ,:-1].values
str_set = set(list(dataset.iloc[ : ,  -1 ].values))
# 使用字典推导式将集合转换为字典
my_dict = {value:index  for index, value in enumerate(str_set)}
Y =np.array([my_dict[item] for item in dataset.iloc[ : ,  -1 ].values])
X,Y

(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

构建训练集、测试集

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                    random_state=0)
X_test,Y_test

(array([[5.8, 2.8, 5.1, 2.4],
        [6. , 2.2, 4. , 1. ],
        [5.5, 4.2, 1.4, 0.2],
        [7.3, 2.9, 6.3, 1.8],
        [5. , 3.4, 1.5, 0.2],
        [6.3, 3.3, 6. , 2.5],
        [5. , 3.5, 1.3, 0.3],
        [6.7, 3.1, 4.7, 1.5],
        [6.8, 2.8, 4.8, 1.4],
        [6.1, 2.8, 4. , 1.3],
        [6.1, 2.6, 5.6, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.5, 2.8, 4.6, 1.5],
        [6.1, 2.9, 4.7, 1.4],
        [4.9, 3.1, 1.5, 0.1],
        [6. , 2.9, 4.5, 1.5],
        [5.5, 2.6, 4.4, 1.2],
        [4.8, 3. , 1.4, 0.3],
        [5.4, 3.9, 1.3, 0.4],
        [5.6, 2.8, 4.9, 2. ],
        [5.6, 3. , 4.5, 1.5],
        [4.8, 3.4, 1.9, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [6.2, 2.8, 4.8, 1.8],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.8, 1.9, 0.4],
        [6.2, 2.9, 4.3, 1.3],
        [5. , 2.3, 3.3, 1. ],
        [5. , 3.4, 1.6, 0.4]]),
 array([0, 1, 2, 0, 2, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 2, 2, 0, 1,
        2, 2, 0, 2, 2, 1, 1, 2]))

训练多元线性回归模型

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, Y_train)

在测试集上预测结果

y_pred = regressor.predict(X_test)
y_pred

array([-0.06703909,  1.03511926,  2.14594224,  0.18626863,  2.03780827,
       -0.26136881,  2.02684168,  0.6804437 ,  0.71699886,  0.88711034,
        0.45090173,  0.69846035,  0.7872897 ,  0.67013112,  0.66952456,
        2.07605449,  0.64392345,  0.78436045,  1.9626113 ,  2.02449662,
        0.20114236,  0.6053914 ,  1.92471089,  1.97696001,  0.40966628,
        2.11806588,  1.85084934,  0.8328718 ,  1.09349115,  1.89308423])

可视化

plt.scatter(Y_test,y_pred, color='red')

plt.xlabel("Prediction")
plt.ylabel("True")

plt.show()

在这里插入图片描述

无涯学徒1998

关注

21
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

	Hours	Scores
0	2.5	21
1	5.1	47
2	3.2	27
3	8.5	75
4	3.5	30
5	1.5	20
6	9.2	88
7	5.5	60
8	8.3	81
9	2.7	25
10	7.7	85
11	5.9	62
12	4.5	41
13	3.3	42
14	1.1	17
15	8.9	95
16	2.5	30
17	1.9	24
18	6.1	67
19	7.4	69
20	2.7	30
21	4.8	54
22	3.8	35
23	6.9	76
24	7.8	86
25	9.1	93
26	9.2	93
27	9.5	93

	Hours	Scores
0	2.5	21
1	5.1	47
2	3.2	27
3	8.5	75
4	3.5	30
5	1.5	20
6	9.2	88
7	5.5	60
8	8.3	81
9	2.7	25
10	7.7	85
11	5.9	62
12	4.5	41
13	3.3	42
14	1.1	17
15	8.9	95
16	2.5	30
17	1.9	24
18	6.1	67
19	7.4	69
20	2.7	30
21	4.8	54
22	3.8	35
23	6.9	76
24	7.8	86
25	9.1	93
26	9.2	93
27	9.5	93

L2打卡学习笔记

机器学习｜数据预处理&线性回归

数据预处理

导入数据集

处理丢失数据

进行Label编码

拆分训练集和测试集

特征标准化

SimpleImputer总结

用均值替换缺失值

用常数替换缺失值

线性回归模型

简单线性回归模型

预测结果

可视化

多元线性回归模型

导入数据

数据分析

构建训练集、测试集

训练多元线性回归模型

在测试集上预测结果

可视化

	Hours	Scores
0	2.5	21
1	5.1	47
2	3.2	27
3	8.5	75
4	3.5	30
5	1.5	20
6	9.2	88
7	5.5	60
8	8.3	81
9	2.7	25
10	7.7	85
11	5.9	62
12	4.5	41
13	3.3	42
14	1.1	17
15	8.9	95
16	2.5	30
17	1.9	24
18	6.1	67
19	7.4	69
20	2.7	30
21	4.8	54
22	3.8	35
23	6.9	76
24	7.8	86
25	9.1	93
26	9.2	93
27	9.5	93