【Machine Learning】笔记(4)

白鸥问舟

已于 2024-05-07 22:21:07 修改

阅读量22

点赞数

分类专栏：机器学习文章标签：机器学习

于 2022-08-02 16:05:28 首次发布

本文链接：https://blog.csdn.net/weixin_43763353/article/details/126086834

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

无监督学习

聚类问题(clustering)

聚类算法着眼于许多数据点并自动查找彼此相关或相似的数据点。

区别于监督学习中数据以输入、输出(标签)的形式成对出现，无监督学习的数据只有输入。而聚类算法就是在这些无标签数据中找到一种特定类型的结构，即对无标签数据分组。

聚类算法应用：搜索相似新闻报道、比对DNA、处理天文数据

K-means聚类

该算法试图寻找一组数据的中心(假设数据为坐标点)，即簇质心。

算法概述

对于数据集，随机设置 $n$ 个簇质心( $n$ 手动设定)。
遍历所有数据，将各数据分配给距其最近的一个簇质心。
对于各簇质心所包含的数据点，计算其均值并将簇质心移动到均值处。
重复2、3步直至数据点的分配和簇质心点位置不再变化，此时算法收敛。

细节:

随机初始化簇质心 $\mu_k$ ( $k$ 个簇质心)时，需保证簇质心数据类型同数据集数据类型一致。
$c^{(i)}$ 表示数据 $x^{(i)}$ 所在的簇索引，其取值为 $1, 2, ..., k$ ； $\mu_{c^{(i)}}$ 表示数据 $x^{(i)}$ 所在簇的簇质心。
簇质心与数据点距离的计算通过范数： $||x_i-\mu_k||^2$ ；计算各簇内数据均值时需要对数据所有维度求均值。
若某个簇内数据点个数为零，则删除该簇(此时有 $k - 1$ 个簇)或重新随机初始化 $k$ 个簇。
取范数的平方可以通过最小化代价函数的方式使簇质心移动到簇中心。

代价函数

类似监督学习，K-means聚类算法也有对应的代价函数：
$J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} ||x^{(i)}-\mu_{c^{(i)}}||^2 \tag{1}$
式 $(1)$ 又称 失真函数(distortion function)
最小化式 $(1)$ 试图寻找最优的数据分配 $c^{(m)}$ 及簇质心 $\mu_k$ 。
在K-means算法中为数据分配簇以及移动簇质心的操作都将使代价函数下降，故算法正确且正常运行的情况下，代价函数只会下降或保持不变。

初始化

显然， $k < m$ 。
在随机初始化 $k$ 个簇质心 $\mu_k$ 时，通常的做法为从 $m$ 个数据 $x^{(i)}$ 中随机选择 $k$ 个作为初始簇质心。

簇质心的初始位置将影响最终算法收敛结果，不好的初始位置可能导致算法陷入局部最优解。故可多次运行算法，选择其中最好(失真函数最小)的结果。

对于 $k$ 值的选择，可根据具体应用需要以及其某 $k$ 值在应用中的具体表现。在实际应用中， $k$ 值增大往往意味着成本增加， $k$ 值减小往往意味着归类效果减弱，故 $k$ 值的选择还需根据实际使用中的平衡考虑。
肘部法(elbow method)：
尝试多种 $k$ 值，绘制失真函数关于 $k$ 的曲线，该曲线一般为单调递减的曲线(由于随着 $k$ 的增大，失真函数势必减小)。
选择该曲线中“肘部”点(该点后曲线下降速度有明显减缓)对应 $k$ 值。
但对于一些情况，上述曲线的下降速度均匀平滑，没有明显的肘部点可选。

算法实现

簇质心初始化

def kMeans_init_centroids(X, K):
    """
    This function initializes K centroids that are to be 
    used in K-Means on the dataset X
    
    Args:
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
    
    Returns:
        centroids (ndarray): Initialized centroids
    """
    
    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])
    
    # Take the first K examples as centroids
    centroids = X[randidx[:K]]
    
    return centroids

找到距离最近簇质心

def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """
    # Set K
    K = centroids.shape[0]
    idx = np.zeros(X.shape[0], dtype=int)

    for i in range(X.shape[0]):
        # Array to hold distance between X[i] and each centroids[j]
        distance = [] 
        for j in range(centroids.shape[0]):
        	# calculate the norm between (X[i] - centroids[j])
            norm_ij = np.linalg.norm(X[i] - centroids[j])
            distance.append(norm_ij)
        # calculate index of minimum value in distance    
        idx[i] = np.argmin(distance)
    
    return idx

将簇质心移动至各簇中所有数据均值处

def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    # Useful variables
    m, n = X.shape
    
    # You need to return the following variables correctly
    centroids = np.zeros((K, n))
    
    for k in range(K): 
    	# get a list of all data points in X assigned to centroid k  
        points = X[idx == k] 
        # compute the mean of the points assigned 
        centroids[k] = np.mean(points, axis = 0)
    
    return centroids

运行算法(打印簇质心移动过程)

def run_kMeans(X, initial_centroids, max_iters=10, plot_progress=False):
    """
    Runs the K-Means algorithm on data matrix X, where each row of X
    is a single example
    """
    
    # Initialize values
    m, n = X.shape
    K = initial_centroids.shape[0]
    centroids = initial_centroids
    previous_centroids = centroids    
    idx = np.zeros(m)
    
    # Run K-Means
    for i in range(max_iters):
        
        #Output progress
        print("K-Means iteration %d/%d" % (i, max_iters-1))
        
        # For each example in X, assign it to the closest centroid
        idx = find_closest_centroids(X, centroids)
        
        # Optionally plot progress
        if plot_progress:
            plot_progress_kMeans(X, centroids, previous_centroids, idx, K, i)
            previous_centroids = centroids
            
        # Given the memberships, compute new centroids
        centroids = compute_centroids(X, idx, K)
    plt.show() 
    return centroids, idx

异常检测(Anomaly Detection)

异常检测算法学习正常数据集(无标签)从而获得检测异常数据的能力。

异常检测应用：算法学习正常的飞机发动机热量、震动等信息，检测新发动机是否达标。

密度估计(density estimation)

该算法试图为训练数据集建立概率分布模型，若 $p(x_{test})<\varepsilon$ ，则认为新数据存在异常。

高斯分布(Gaussian distribution)

又称正态分布(Normal distribution)、钟形分布(bell-shaped distribution)。
$\mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi\sigma^2 }}e^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }\tag{2}$
其中 $\mu$ 表示均值， $\sigma^2$ 表示方差。
$\left\{\begin{matrix} \mu_j = \frac{1}{m} \sum_{i=1}^m x_j^{(i)} \\ \sigma_j^2 = \frac{1}{m} \sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 \end{matrix}\right. \tag{3}$

在这里插入图片描述
数据集 ${x^{(1)}, ..., x^{(m)}\}$ ，其中每个 $x^{(i)}$ 为向量，长度为 $n$ ，即 $x^{(i)}=[x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n]$ 。
例如有 $m$ 台发动机数据样本，每个样本中包含发热量、震动等 $n$ 钟数据。

则 $p(x^{(i)})=\prod_{j=1}^{n}p(x^{(i)}_j; \mu_j,\sigma ^2_j) \tag{4}$
式 $(4)$ 假设 $x^{(i)}_1$ ~ $x^{(i)}_n$ 统计独立，但事实上即使不统计独立，算法仍有效。

算法概述

创建数据集，选择长度为 $n$ 的 $m$ 个向量 $x^{(i)}$ 构成异常检测特征。
根据式 $(3)$ 分别计算各特征对应 $\mu_j、 \sigma^2_j$ 。
对于新数据 $x^{test}$ ，计算 $p(x^{test})$ ，根据阈值 $\varepsilon$ 判断数据是否异常。

分析上述算法可知，当被检测输入特征中有一项存在异常( $p(x^{test}_j; \mu_j,\sigma ^2_j)$ 很小)，最终 $p(x^{test})$ 也将会很小

阈值选择

在算法中修改参数往往需要依据对算法性能的评估方法，在监督学习中可利用模型预测或分类准确率衡量，而无监督学习可利用模型处理有标签数据从而获得性能评价指标。

根据上述思想，创建训练集 ${x^{(1)}, ..., x^{(m)}\}$ 、验证集 $\{x^{(1)}_{val}, ..., x^{(m_{val})}_{val}\}$ 、测试集 $\{x^{(1)}_{test}, ..., x^{(m_{test})}_{test}\}$ ，其中训练集为正常数据(都是正常数据即无标签，则仍为无监督学习)，验证集和测试集包含部分异常数据。

若异常数据很少，可考虑将测试集并入验证集。但该方法将失去客观评价模型性能的指标。
值得注意的是，异常数据往往远小于正常数据，即为倾斜数据集，此时需要使用 精确率、召回率 和 $F_1$ score 评价模型性能指标。

利用上述方法，通过验证集调整模型超参数(此处为阈值 $\varepsilon$ )

疑问：既然有异常数据可以构建带标签的验证集或测试集，为何不使用监督学习训练模型呢？
答：根据已有数据类型选择。

虽然有异常数据，但其数量远小于正常数据，该情况下使用无监督学习更为合理；若异常数据和正常数据数量相当，利用监督学习更合适。
理由是若异常数据数量过小，监督学习将无法从中学到足够分辨异常数据的能力。

同时，若未来异常数据不确定性很大，无监督学习同样更适用。理由是现有异常数据与未来可能出现的异常数据差距较大，监督学习算法难以将其归为一类。无监督学习更善于发现未知异常，而监督学习更善于发现已知异常。

特征选择

在监督学习中，如果某些特征选择不当或者冗余，往往对模型性能不会存在较大影响。因为监督学习有足够的标签告诉模型哪些特征更为重要。
但对于无监督学习，一些不当或冗余的特征将会对模型性能造成较大影响。例如训练数据中存在大量蓝色发动机数据，则模型认为红色发动机存在异常。

综上，对于无监督学习的特征选择：
使特征呈高斯分布
呈高斯分布的特征利于模型拟合数据；对于不呈高斯分布的特征，考虑不使用该特征或将其变换为高斯分布(例如用 $log{x+b}$ 、 $\sqrt{x}$ 等取代 $x$ )
利用如下方法观测 $x$ 分布或者 $x$ 变换的分布：

import matplotlib.pyplot as plt
plt.hist(x, bins = 50)
plt.hist(x**0.5, bins = 50)
plt.hist(np.log(x+10), bins = 50)

若对训练集数据进行变换，需要对验证集和测试集数据做出同样变换。

增加输入特征维度
若模型在验证集表现不佳，观测误判样本，分析为何模型无法区分该异常数据与其他正常数据，是否存在未被考虑的输入特征对数据异常与否起关键作用。

算法实现

拟合高斯分布

def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """

    m, n = X.shape
    # calculate the mean of every feature
    mu = 1 / m * np.sum(X, axis = 0)
    # calculate the variance of every feature
    var = 1 / m * np.sum((X - mu) ** 2, axis = 0)
        
    return mu, var

阈值选择

def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 

    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    	# calculate predictions for each example using epsilon as threshold
        predictions = (p_val < epsilon)
        
		# calculate number of true positives
        tp = np.sum((predictions == 1) & (y_val == 1))
        # calculate number of false positives
        fp = fp = sum((predictions == 1) & (y_val == 0))
        # calculate number of false negatives
        fn = fn = np.sum((predictions == 0) & (y_val == 1))

        # calculate precision
        prec = tp / (tp + fp)
        # calculate recall
        rec = tp / (tp + fn)

        # calculate F1 score
        F1 = 2 * prec * rec / (prec + rec)
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

找出异常数据

# 训练
mu, var = estimate_gaussian(X_train)  
p = multivariate_gaussian(X_train, mu, var)
# 验证
p_val = multivariate_gaussian(X_val, mu, var)
epsilon, F1 = select_threshold(y_val, p_val)
select_threshold_test(select_threshold)
# 找出异常
outliers = p < epsilon

推荐系统(recommended systems)

推荐系统算法学习用户数据，预测用户对于新产品的评价，据此为用户推荐新产品；或者提取产品特征，为偏好类似产品的用户做推荐。

一般而言，用户数据为用户对产品的偏好(是否对该产品做出评价、对该产品评价如何)。同时还需要有产品信息(产品的特征信息)。

模型符号

General Notation	Description
$r (i, j)$	scalar; = 1 if user $j$ rated game $i$ ,= 0 otherwise
$y (i, j)$	scalar; = rating given by user $j$ on game $i$ (if $r (i, j)$ = 1 is defined)
$\mathbf{w}^{(j)}$	vector; parameters for user $j$
$b^{(j)}$	scalar; parameter for user $j$
$m^{(j)}$	scalar; number of movies rated by user $j$
$n^{(i)}$	scalar; number of users rate movie $j$
$\mathbf{x}^{(i)}$	vector; feature ratings for movie $i$
$n_u$	number of users
$n_m$	number of movies
$n$	number of features
$\mathbf{X}$	matrix of vectors $\mathbf{x}^{(i)}$
$\mathbf{W}$	matrix of vectors $\mathbf{w}^{(j)}$
$\mathbf{b}$	vector of bias parameters $b^{(j)}$
$\mathbf{R}$	matrix of elements $r (i, j)$
$\mathbf{x}_{m}^{(i)}$	vector; feature of movie $i$
$\mathbf{x}_{u}^{(j)}$	vector; feature of user $j$

其中 $\mathbf{w}^{(j)}、\mathbf{x}^{(i)}$ 长度为 $n$ 。

预测用户 $j$ 对于产品 $i$ 的评价：
$\hat{y}(i,j)=\mathbf{w}^{(j)}·\mathbf{x}^{(i)}+b^{(j)}\tag{5}$
已知产品特征求取用户偏好模型
对于已知产品特征及用户评分，需要学习参数 $\mathbf{w}^{(j)}、b^{(j)}$ 以获得用户偏好模型。
对于用户 $j$ ，其代价函数为：
$J({\mathbf{w}^{(j)},b^{(j)}})= \frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2m^{(j)}}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2 \tag{6}$
其中红色部分为正则项。

由于式 $(6)$ 中 $m^{(j)}$ 为常数，不影响最小化代价函数所得参数，故可将其约去。
故对于所有用户，代价函数为：
$J(\mathbf{w}^{(1)},b^{(1)},...,\mathbf{w}^{(n_u)},b^{(n_u)})= \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2 \tag{7}$
本质上是利用线性回归得到每个用户的模型，用于预测用户评价。

已知用户偏好模型求取产品特征
对于已知用户偏好模型和用户评分，而没有产品特征数据的情况，可利用算法学习获得产品特征数据。

由于已有多个用户对于产品 $i$ 的评价，可通过已知用户偏好( $\mathbf{w}^{(j)}、b^{(j)}$ )猜测产品特征 $\mathbf{x}^{(i)}$ ，则代价函数为：
$J({\mathbf{x}^{(i)}})= \frac{1}{2}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{8}$
最小化式 $(8)$ ，可以基于评价过产品 $i$ 的用户数据得到对于产品 $i$ 的特征。
对于所有产品，代价函数为：
$J(\mathbf{x}^{(1)},...,\mathbf{x}^{(n_m)})= \frac{1}{2}\sum_{i=1}^{n_m}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{9}$

协同过滤算法(collaborative filtering)

综上，对于用户偏好模型和产品特征都未知的情况，可联立式 $(7) 、 (9)$ ，得：
$J(\mathbf{x},\mathbf{w},b)= \frac{1}{2}\sum_{(i,j):r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{10}$
最小化式 $(10)$ ，可获得产品特征及用户偏好模型。
利用梯度下降优化参数 $\mathbf{x},\mathbf{w},b$ ：
$\left\{\begin{matrix} \mathbf{w}^{(j)}_l=\mathbf{w}^{(j)}_l-\alpha\frac{\partial}{\partial\mathbf{w}^{(j)}_l} J(\mathbf{x},\mathbf{w},b) \\ b^{(j)}=b^{(j)}-\alpha\frac{\partial}{\partial b^{(j)}} J(\mathbf{x},\mathbf{w},b) \\ \mathbf{x}^{(i)}_k=\mathbf{x}^{(i)}_k-\alpha\frac{\partial}{\partial\mathbf{x}^{(i)}_k} J(\mathbf{x},\mathbf{w},b) \end{matrix}\right. \tag{11}$
协同过滤算法有效的原因是多个用户对某产品进行评价，以此可估计该产品的特征，进一步估计用户的偏好模型。总而言之，需要有足够的数据"相互补全"。

协同过滤算法存在“冷启动问题”：难以预测新用户(没有或很少对产品做出过评价)对产品的评价；或则预测用户对新产品(没有或很少被用户做出过评价)的评价。

二进制评价

上述方法中用户对于产品的评价 $y (i, j)$ 为标量，意义为“评分”。对于用户给予产品“二进制评价”(喜欢或不喜欢)的情况，协同过滤算法仍然适用。

对于二进制评价的情况， $y (i, j) = 1$ 表示用户喜欢该产品， $y (i, j) = 0$ 表示用户不喜欢该产品。

类似逻辑回归，通过下式预测 $y (i, j) = 1$ 的概率：
$f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)})=g(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} ) =\frac{1}{1+e^{-(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} )}}\tag{12}$
则损失函数：
$L(f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i,j)}) = -y^{(i,j)} \log\left(f_{\mathbf{x},\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i,j)}\right) \log \left( 1 - f_{\mathbf{x},\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{13}$
代价函数：
$J(\mathbf{x},\mathbf{w},b)=\sum_{(i,j):r(i,j)=1}L(f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i,j)})\tag{14}$

均值归一化(mean normalization)

利用式 $(10)$ 对用户偏好模型进行拟合时，若存在用户 $p$ 未对任何产品做出评价，则最小化式 $(10)$ 将使用户参数 $\mathbf{w}^{(p)}$ 为0（对于任一产品 $i$ ， $y (i, p) = 0$ ，即 $\mathbf{w}^{(p)}$ 对式 $(10)$ 中损失函数项无影响；而由于存在正则项将导致 $\mathbf{w}^{(p)}$ 尽可能小)。
即对于新用户 $p$ ，算法将预测其对于所有产品评价均为0，难以向其推荐产品。

计算产品 $i$ 所获评价均值 $\bar{y}^{(i)}=\sum_{i:r(i,j)=1}^{} \frac{y(i,j)}{n^{(i)}}$ ，则各用户对各产品评价 $y'(i,j)=y(i,j)-\bar{y}^{(i)}$ 。

经过上述均值归一化，对于用户 $j$ ，预测其对于产品 $i$ 的评价满足：
$\hat{y}(i,j)=\mathbf{w}^{(j)}·\mathbf{x}^{(i)}+b^{(j)}+\bar{y}^{(i)}\tag{15}$

则对于新用户 $p$ ，算法预测其对于产品 $i$ 的评价为 $\hat{y}(i,p)=\bar{y}^{(i)}$ ，即可根据老用户对于各产品的评价向新用户推荐产品。

同样，对用户 $j$ 所有评价进行均值归一化获得 $\bar{y}^{(j)}$ ，可以将新产品推荐给倾向于给予好评的用户。

特征相关

通过协同过滤算法获得了最优参数 $\mathbf{x},\mathbf{w},b$ 。通过 $\mathbf{w},b$ 可以预测用户对新产品的打法从而进行推荐。而通过 $\mathbf{x}$ 可以寻找类似产品进行推荐。

对于产品 $i$ 的特征 $x^{(i)}$ ，通过计算其他产品 $k$ 的特征 $x^{(k)}$ 与其之间范数：
$||x^{(k)}-x^{(i)}||^2= \sum^{n}_{l=1}(x^{(k)}-x^{(i)})^2\tag{16}$
则可找到与产品 $i$ 相似的产品，并以此做推荐。

算法实现

tensorflow能自动计算导数，故除了对于神经网络的搭建，其在其他算法也能发挥巨大作用。

利用tensorflow对下式进行梯度下降求取最优参数：
$J(w)=(w·x-y)^2$

w = tf.Variable(3.0)
# tf.Variable is the parameters we want to optimize
x = 1.0 # input feature
y = 1.0 # label
alpha = 0.01 #learning rate

iterations = 30
for iter in range(iterations):
	# use Tensorflow's Graident tape to record the steps used to 
	# compute the costJ,to enable auto differentiation.
	with tf.GradientTape() as tape:
		fwb = w*x
		costJ = (fwb - y)**2 # cost function
	# use the graident tape to calculate the graidents of the cost
	# with respect to the parameter w
	[dJdw] = tape.gradient(costJ , [w])
	# run one step of graident descent by updating the value of w
	# to reduce the cost
	w.assign_add(-alpha * dJdw)

协同过滤算法代价函数，即式 $(10)$ ：

def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    nm, nu = Y.shape
    J = 0
    
    for j in range(nu):
        w = W[j,:]
        b_j = b[0,j]
        for i in range(nm):
            x = X[i,:]
            y = Y[i,j]
            r = R[i,j]
            J += np.square(r * (np.dot(w,x) + b_j - y ) )
    J += lambda_* (np.sum(np.square(W)) + np.sum(np.square(X)))
    J = J/2

    return J

式 $(10)$ 矢量化实现(更高效)：

def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

利用Adam优化器实现梯度下降算法，即式 $(11)$ ：

movieList, movieList_df = load_Movie_List_pd()

# Reload ratings and add new ratings
Y, R = load_ratings_small()
Y    = np.c_[my_ratings, Y]
R    = np.c_[(my_ratings != 0).astype(int), R]
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')


# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)
# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

iterations = 200
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )

内容过滤算法

对于用户或产品，其都存在一些自身特征：如包括年龄、国籍、平均打分等在内的用户特征 $\mathbf{x}_{u}^{(j)}$ ；包括产地、价格、平均得分等在内的产品特征 $\mathbf{x}_{m}^{(i)}$ 。

内容过滤算法试图基于用户或产品特征进行匹配。

深度学习提取

$\mathbf{x}_{u}^{(j)}、\mathbf{x}_{m}^{(i)}$ 尺寸不尽相同，试图分别找到二者的同尺寸提取 $\mathbf{v}_{u}^{(j)}、\mathbf{v}_{m}^{(i)}$ 以进行匹配。

利用神经网络进行提取，其中"用户网络"、“产品网络”的输入层尺寸分别与 $\mathbf{x}_{u}^{(j)}、\mathbf{x}_{m}^{(i)}$ 尺寸对应，输出层尺寸与 $\mathbf{v}_{u}^{(j)}$ 或 $\mathbf{v}_{m}^{(i)}$ 一致。

计算用户 $j$ 与产品 $i$ 的匹配度(评分)：
$\mathbf{v}_{u}^{(j)}·\mathbf{v}_{m}^{(i)}\tag{17}$
对于二进制标签可通过sigmoid函数计算用户 $j$ 匹配(喜欢)产品 $i$ 的概率：
$\mathbf{v}_{u}^{(j)}·\mathbf{v}_{m}^{(i)})\tag{18}$
在这里插入图片描述

对于“用户网络”和“产品网络”，同时训练其参数，代价函数为：
$J=\sum_{(i,j):r(i,j)=1}^{}(P - y^{(i,j)})^2+ \color{red}神经网络正则项$

特征相关

类似协同过滤算法式 $(16)$ ，利用式 $(19)$ 寻找相似产品
$||\mathbf{v}_{m}^{(i)}-\mathbf{v}_{m}^{(k)}||^2= \sum^{n}_{l=1}(\mathbf{v}_{m}^{(i)}-\mathbf{v}_{m}^{(k)})^2\tag{19}$

实际应用

实际应用中，计算大量产品与某用户间的匹配度 $P$ 是不现实的，可利用检索&排名(Retrieval&Ranking)
检索 (利用用户特征及产品特征粗略找出产品)：

与该用户喜爱产品类似的产品(通过式 $(19)$ )
用户喜爱的产品类型中最受欢迎的产品
该地区最受欢迎的产品

从检索内容中移除重复产品或用户已经浏览(购买)过的产品

排名：计算上述粗选产品与用户匹配度并排名，此为最终推荐目录

算法实现

num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([  
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_outputs), 
])

item_NN = tf.keras.models.Sequential([   
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_outputs),
])

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = Model([input_user, input_item], output)

# specify the cost funcion
cost_fn = tf.keras.losses.MeanSquaredError()