【吃瓜训练营】西瓜书第一章和第二章第二次-CSDN博客

本文链接：https://blog.csdn.net/klmyty/article/details/129036321

机器学习第一、二章

第一章绪论

1.2 基本术语

D={x1,x2,…,xm} 表示m个示例的数据集，每个示例由d个属性(attribute，feature)描述，每个示例xi=(xi1;xi2;…;xij)是d维样本空间X中的一个向量。
在csv文件中d为列数，m为行数，xi为第i行向量，是样本（instance，sample)
假设hypothesis：学得模型对应了关于数据的某种潜在的规律
ground-truth：潜在规律自身

1.5 发展历程

从样例中学习：

符号主义学习：
- 决策树
- 基于逻辑的学习（eg.ILP归纳逻辑程序设计）
基于神经网络的连接主义学习（eg.BP神经网络）
- 连接主义学习的最大局限是其“试错性”；
- 简单地说,其学习过程涉及大量参数，而参数的设置缺乏理论指导，主要靠手工“调参”；
- 夸张一点说，参数调节上失之毫厘，学习结果可能谬以千里
“统计学习”：
- 支持向量机SVM
- 核方法kernel method
深度学习：很多层的神经网络
多用于语音、图像等复杂对象中

第二章模型评估与选择

2.1 经验误差与过拟合

精度accuracy=1-错误率，错误率为E,当m个样本有a个分类错误，E=a/m
误差：学习器的实际预测输出与样本的真实输出之间的差异称为“误差”(error)
- 学习器在训练集上的误差称为“训练误差”(training error)或"经验误差”(empirical error),
- 在新样本上的误差称为“泛化误差"(generalization error)
NP问题：
- NP（非确定性多项式）问题是可以在多项式时间内猜测和验证其解的问题，并且猜测是以非确定性方式完成的，这意味着没有遵循特定规则。
- 如果一个问题是 NP 问题，并且所有其他 NP 问题都可以在多项式时间内归结为它，则该问题被认为是 NP 完全问题。（NP-complete）
- 如果解决问题的算法可以转化为解决任何其他 NP 问题的算法，则该问题被认为是 NP 难问题（NP-hard)
- 人们认为证明一个问题是 NP 问题比证明它是 NP-hard 问题更容易。既是 NP 又是 NP-hard 的问题称为 NP-完全问题。
测试集训练集的划分方法：
- 留出法 hold-out
  
  需注意的是，训练/测试集的划分要尽可能保持数据分布的一致性，避免因数据划分过程引入额外的偏差而对最终结果产生影响，例如在分类任务中至少要保持样本的类别比例相似.
  
  分层采样stratified sampling：保留类别比例的采样
  
  在train_test_split()中加入stratify=y可以分层抽样，保持类别比例
```
# Split the data into training and test sets using stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Check the class distribution in the training set
print("Class distribution in training set: ", np.bincount(y_train))

# Check the class distribution in the test set
print("Class distribution in test set: ", np.bincount(y_test))
```
- 交叉验证 cross validation
  - 先将数据集 D 划分为 k 个大小相似的互斥子集，每个子集都尽可能保持数据分布的一致性，即从D 中通过分层采样得到
  - 然后，每次用 k - 1 个子集的并集作为训练集，余下的那个子集作为测试集
  - 这样就可获得 k 组训练/测试集，从而可进行 k 次训练和测试，最终返回的是这 k 个测试结果的均值
  - 5折交叉验证代码示例

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# create the dataset

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([1, 2, 3, 4, 5])

# initialize the model

clf = RandomForestClassifier()

# define the number of folds

kfold = KFold(n_splits=5, shuffle=True)

# loop over each fold

for train_index, test_index in kfold.split(X):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]

# fit the model on the training data

  clf.fit(X_train, y_train)

# evaluate the model on the test data

  score = clf.score(X_test, y_test)

  print(f"Score: {score}")