机器学习学习笔记6

C-beams

已于 2024-04-24 16:08:12 修改

阅读量1.1k

点赞数 36

分类专栏：机器学习学习笔记文章标签：机器学习学习笔记

于 2024-04-24 16:07:56 首次发布

本文链接：https://blog.csdn.net/2401_82787858/article/details/138155463

版权

机器学习学习笔记专栏收录该内容

6 篇文章 0 订阅

订阅专栏

L11 决策树与随机森林

Classfication tree

internal node : 1) dimension index j ;split value s ; 2) two child nodes : internal or leaf

leaf node : label

features: $[x_1,x_2,x_3,x_4,x_5,x_6]$ = [date,age,height,weight,sinus tachycardia?,min systolic bp]

labels y : 1:hight risk ; -1: low risk

Regression tree

features: $[x_1,x_2]$ = [temperature(deg C),precipation(cm/hr)]

labels y : km run

Tree defines an axis-aligned "partition" of the feature space

Decision tree

Recall : familiar pattern $\hat{y}$

Choose how to predict label(given features & parameters))
Choose a loss(between guess & actual label)
Choose parameters by trying to minimize the training loss

Parameters here:

For each internal node: split dimension , split value , child nodes
For each leaf node : label
Note : parameters here don't have a fixed dimension
Can't apply (S)GD
We'll develop a heuristic :1) build , 2) prune

Building a decision tree $I_{j,s}^+=\{i\in I |x^{((i))}_j\geq s\}$

Regression tree with squared error loss

BuildTree(I,k)

if $|I|\leq k$

set $\hat{y}=average_{i\in I}y^{(i)}$

return Leaf( label = $\hat y$ )

else

for each split dim j & value s

Set $I_{j,s}^+=\{i\in I |x^{((i))}_j\geq s\}$

Set $I_{j,s}^-=\{i\in I |x^{((i))}_j< s\}$

Set $\hat y_{j,s}^+=average_{i\in I^+_{j.s}}y^{(i)}$

Set $\hat y_{j,s}^-=average_{i\in I^-_{j.s}}y^{(i)}$

Set $E_{j,s}=\sum_{i\in I^+_{j.s}}(y^{(i)}-\hat y_{j.s}^+)^2+\sum_{i\in I^-_{j.s}}(y^{(i)}-\hat y_{j.s}^-)^2$

Set $(j^*,s^*)=arg\underset{j,s}{min}E_{j,s}$

return Node( $(j^*,s^*,BuildTree(I_{j^*,s^*}^-,k),BuildTree(I_{j^*,s^*}^+,k))$

Regularize , prune and Ensembling

"Cost complexity" of a tree T $C_{\alpha}(T)=\sum_{i=1}^nL(T(x^{(i)}),y^{(i)})+\alpha |T|$
Pruning
- For each $\alpha$ , choose $T_{\alpha}$ by pruning subtrees until it's not worthwhile
- Choose a final tree by cross validation
Using multiple machine learning predictors to make one(ideally way-better) predictor

Bagging

One of multiple ways to make and use an ensemble
Bagging = Bootstrap aggregating
- Training data $D_n$
- For b = 1, ..., B
  - Draw a new "data set" $\tilde D_n^{(b)}$ of size n by sampling with replacement from $D_n$
  - Train a predictor $\hat f^{(b)}$ on $\tilde D_n^{(b)}$
  - For regression : the predictor $\hat f_{bag}(x)=\frac{1}{B}\sum_{b=1}^B\hat f^{(b)}(x)$
  - Classification : predictor at a point is class with highest vote count at that point

Random forests

Bagging + decision trees + extra randomness
Random forest
- For b = 1, ..., B
  - Draw a new "data set" $\tilde D_n^{(b)}$ of size n by sampling with replacement from $D_n$
  - Build a tree on by recursively repeating the following until minimum node-size k is reached:
    - Select m features uniformly at random,with out repacement,from the d features
    - Pick the best split dimension and split value among the m features
    - Build two children
- Return: average for regression; vote for classification

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data[:,[2,3]] # 取特征的后两个维度
y = iris.target

# 划分训练集和测试集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

# 构建随机森林模型
rf_clf = RandomForestClassifier(n_estimators=100,random_state=42)
rf_clf.fit(X_train,y_train)

# 在测试集上预测
y_pred = rf_clf.predict(X_test)

# 计算模型准确率
accuracy = accuracy_score(y_test,y_pred)
print("Acuracy:",accuracy)

# 可视化决策边界
def plot_decision_regions(X,y,classifier,test_idx=None,resolution=0.2):
    markers = ('s','x','o','^','v')
    colors = ('red','blue','lightgreen','gray','cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    x1_min,x1_max = X[:,0].min()-1,X[:,0].max()+1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))
    Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1,xx2,Z,alpha=0.5,cmap=cmap)
    plt.xlim(xx1.min(),xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx,cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y==cl,0],y=X[y==cl,1],alpha=0.8,c=[cmap(idx)],marker=markers[idx],label=cl)
    if test_idx:
        X_test,y_test = X[test_idx,:],y[test_idx]
        plt.scatter(X_test[:,0],X_test[:,1],c='',edgecolors='black',alpha=1.0,linewidths=1,marker='o',s=100,label='Test Set')

plot_decision_regions(X_train,y_train,classifier=rf_clf)
plt.title('Random Forest Classifier - Decision Boundary(Training Set)')
plt.xlabel('Petal Length(cm)')
plt.ylabel('Petal Width (cm')
plt.legend(loc='upper left')
plt.show()

Acuracy: 1.0

L12 聚类算法

Food distribution placement

Where should I have my k food trucks park?
Want to minnimize the loss of people we serve.
Inputs : person i location $x^{(i)}$
Outputs : truck j location $\mu^{(j)}$
Index of truck where people i walks : $y^{(i)}$
Loss if i walk to Truck j : $||x^{(i)}-\mu^{(j)}||_2^2$
Loss across all people :

$\sum_{j=1}^k\sum_{i=1}^n1\{y^{(i)}=j\}||x^{(i)}-\mu^{(j)}||_2^2$

a.k.a k-means objective

k-means algorithm

k-means $(k,\tau)$

Init $\{\mu^{(j)}\}_{j=1}^k,\{y^{(i)}\}_{i=1}^n$

for t = 1 to $\tau$

$y_{old}=y$

for i = 1 to n

$y^{(i)}= arg\underset{j}{min}||x^{(i)}-\mu^{(j)}||_2^2$

for j = 1 to k

$\mu^{(j)}=\frac{\sum_{i=1}^n1\{y^{(i)}=j\}x^{(i)}}{\sum_{i=1}^n1{y^{(i)}=j}}$

if $y=y_{old}$

break

return $\{\mu^{(j)}\}_{j=1}^k,\{y^{(i)}\}_{i=1}^n$

Compare to classification

我们并没有使用任何标签数据
$\{y^{(i)}\}_{i=1}^n$ 可以替换并且能得到相同的聚类簇k
输出仅仅只是数据的划分
我们根据数据间的相似性将他们分类
一个无监督学习的例子：没有标签数据，we're finding a pattern

Initialization

If enough big $\tau$ , it will converge
The initialization can make a big difference
Some options : random restarts

Effect of k and choosing k

Different k will give us different results
Larger k and smaller Loss
Sometimes we know k
Sometimes we'd like to choose/learn k
How to choose k depends on what you'd like to do

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

#特征标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 使用PCA进行降维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# 构建k均值聚类模型
kmeans = KMeans(n_clusters=3,random_state=42)
kmeans.fit(X_scaled)

# 获取聚类中心和预测类别
cluster_centers = kmeans.cluster_centers_
y_pred = kmeans.labels_

# 可视化聚类效果
plt.figure(figsize = (10,8))

# 绘制原始数据的散点图
plt.subplot(2,1,1)
plt.scatter(X_pca[:,0],X_pca[:,1],c=y,cmap='viridis',s=50,alpha=0.8)
plt.title('Original Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# 绘制聚类结果的散点图
plt.subplot(2,1,2)
plt.scatter(X_pca[:,0],X_pca[:,1],c=y_pred,cmap='viridis',s=50,alpha=0.8)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],c='red',marker='x',s=200,label='Cluster Centers')
plt.title('K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()

plt.tight_layout()
plt.show()

C-beams

关注

36
点赞
踩
26

收藏

觉得还不错? 一键收藏
0
评论
机器学习学习笔记6

internal node : 1) dimension index j ;split value s ; 2) two child nodes : internal or leafleaf node : labelfeatures: = [date,age,height,weight,sinus tachycardia?,min systolic bp]labels y : 1:hight risk ; -1: low riskfeatures: = [temperature(deg C),preci
复制链接

扫一扫