【Machine Learning】笔记(4)








  1. 对于数据集,随机设置 n n n个簇质心( n n n手动设定)。
  2. 遍历所有数据,将各数据分配给距其最近的一个簇质心。
  3. 对于各簇质心所包含的数据点,计算其均值并将簇质心移动到均值处。
  4. 重复2、3步直至数据点的分配和簇质心点位置不再变化,此时算法收敛。


  • 随机初始化簇质心 μ k \mu_k μk( k k k个簇质心)时,需保证簇质心数据类型同数据集数据类型一致。
  • c ( i ) c^{(i)} c(i)表示数据 x ( i ) x^{(i)} x(i)所在的簇索引,其取值为 1 , 2 , . . . , k 1,2,...,k 1,2,...,k μ c ( i ) \mu_{c^{(i)}} μc(i)表示数据 x ( i ) x^{(i)} x(i)所在簇的簇质心。
  • 簇质心与数据点距离的计算通过范数: ∣ ∣ x i − μ k ∣ ∣ 2 ||x_i-\mu_k||^2 ∣∣xiμk2;计算各簇内数据均值时需要对数据所有维度求均值。
  • 若某个簇内数据点个数为零,则删除该簇(此时有 k − 1 k-1 k1个簇)或重新随机初始化 k k k个簇。
  • 取范数的平方可以通过最小化代价函数的方式使簇质心移动到簇中心。

J ( c ( 1 ) , . . . , c ( m ) , μ 1 , . . . , μ k ) = 1 m ∑ i = 1 m ∣ ∣ x ( i ) − μ c ( i ) ∣ ∣ 2 (1) J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} ||x^{(i)}-\mu_{c^{(i)}}||^2 \tag{1} J(c(1),...,c(m),μ1,...,μk)=m1i=1m∣∣x(i)μc(i)2(1)
( 1 ) (1) (1)又称 失真函数(distortion function)
最小化式 ( 1 ) (1) (1)试图寻找最优的数据分配 c ( m ) c^{(m)} c(m)及簇质心 μ k \mu_k μk


显然, k < m k<m k<m
在随机初始化 k k k个簇质心 μ k \mu_k μk时,通常的做法为从 m m m个数据 x ( i ) x^{(i)} x(i)中随机选择 k k k个作为初始簇质心。


对于 k k k值的选择,可根据具体应用需要以及其某 k k k值在应用中的具体表现。在实际应用中, k k k值增大往往意味着成本增加, k k k值减小往往意味着归类效果减弱,故 k k k值的选择还需根据实际使用中的平衡考虑。
肘部法(elbow method):
尝试多种 k k k值,绘制失真函数关于 k k k的曲线,该曲线一般为单调递减的曲线(由于随着 k k k的增大,失真函数势必减小)。
选择该曲线中“肘部”点(该点后曲线下降速度有明显减缓)对应 k k k值。



def kMeans_init_centroids(X, K):
    This function initializes K centroids that are to be 
    used in K-Means on the dataset X
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
        centroids (ndarray): Initialized centroids
    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])
    # Take the first K examples as centroids
    centroids = X[randidx[:K]]
    return centroids


def find_closest_centroids(X, centroids):
    Computes the centroid memberships for every example
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
        idx (array_like): (m,) closest centroids
    # Set K
    K = centroids.shape[0]
    idx = np.zeros(X.shape[0], dtype=int)

    for i in range(X.shape[0]):
        # Array to hold distance between X[i] and each centroids[j]
        distance = [] 
        for j in range(centroids.shape[0]):
        	# calculate the norm between (X[i] - centroids[j])
            norm_ij = np.linalg.norm(X[i] - centroids[j])
        # calculate index of minimum value in distance    
        idx[i] = np.argmin(distance)
    return idx


def compute_centroids(X, idx, K):
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
        centroids (ndarray): (K, n) New centroids computed
    # Useful variables
    m, n = X.shape
    # You need to return the following variables correctly
    centroids = np.zeros((K, n))
    for k in range(K): 
    	# get a list of all data points in X assigned to centroid k  
        points = X[idx == k] 
        # compute the mean of the points assigned 
        centroids[k] = np.mean(points, axis = 0)
    return centroids


def run_kMeans(X, initial_centroids, max_iters=10, plot_progress=False):
    Runs the K-Means algorithm on data matrix X, where each row of X
    is a single example
    # Initialize values
    m, n = X.shape
    K = initial_centroids.shape[0]
    centroids = initial_centroids
    previous_centroids = centroids    
    idx = np.zeros(m)
    # Run K-Means
    for i in range(max_iters):
        #Output progress
        print("K-Means iteration %d/%d" % (i, max_iters-1))
        # For each example in X, assign it to the closest centroid
        idx = find_closest_centroids(X, centroids)
        # Optionally plot progress
        if plot_progress:
            plot_progress_kMeans(X, centroids, previous_centroids, idx, K, i)
            previous_centroids = centroids
        # Given the memberships, compute new centroids
        centroids = compute_centroids(X, idx, K)
    return centroids, idx

异常检测(Anomaly Detection)



密度估计(density estimation)

该算法试图为训练数据集建立概率分布模型,若 p ( x t e s t ) < ε p(x_{test})<\varepsilon p(xtest)<ε,则认为新数据存在异常。

高斯分布(Gaussian distribution)

又称正态分布(Normal distribution)、钟形分布(bell-shaped distribution)。
p ( x ; μ , σ 2 ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 (2) p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi\sigma^2 }}e^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }\tag{2} p(x;μ,σ2)=2πσ2 1e2σ2(xμ)2(2)
其中 μ \mu μ表示均值, σ 2 \sigma^2 σ2表示方差。
{ μ j = 1 m ∑ i = 1 m x j ( i ) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j ) 2 (3) \left\{\begin{matrix} \mu_j = \frac{1}{m} \sum_{i=1}^m x_j^{(i)} \\ \sigma_j^2 = \frac{1}{m} \sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 \end{matrix}\right. \tag{3} {μj=m1i=1mxj(i)σj2=m1i=1m(xj(i)μj)2(3)

数据集 { x ( 1 ) , . . . , x ( m ) } \{x^{(1)}, ..., x^{(m)}\} {x(1),...,x(m)},其中每个 x ( i ) x^{(i)} x(i)为向量,长度为 n n n,即 x ( i ) = [ x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ] x^{(i)}=[x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n] x(i)=[x1(i),x2(i),...,xn(i)]
例如有 m m m台发动机数据样本,每个样本中包含发热量、震动等 n n n钟数据。

p ( x ( i ) ) = ∏ j = 1 n p ( x j ( i ) ; μ j , σ j 2 ) (4) p(x^{(i)})=\prod_{j=1}^{n}p(x^{(i)}_j; \mu_j,\sigma ^2_j) \tag{4} p(x(i))=j=1np(xj(i);μj,σj2)(4)
( 4 ) (4) (4)假设 x 1 ( i ) x^{(i)}_1 x1(i)~ x n ( i ) x^{(i)}_n xn(i)统计独立,但事实上即使不统计独立,算法仍有效。

  1. 创建数据集,选择长度为 n n n m m m个向量 x ( i ) x^{(i)} x(i)构成异常检测特征。
  2. 根据式 ( 3 ) (3) (3)分别计算各特征对应 μ j 、 σ j 2 \mu_j、 \sigma^2_j μjσj2
  3. 对于新数据 x t e s t x^{test} xtest,计算 p ( x t e s t ) p(x^{test}) p(xtest),根据阈值 ε \varepsilon ε判断数据是否异常。

分析上述算法可知,当被检测输入特征中有一项存在异常( p ( x j t e s t ; μ j , σ j 2 ) p(x^{test}_j; \mu_j,\sigma ^2_j) p(xjtest;μj,σj2)很小),最终 p ( x t e s t ) p(x^{test}) p(xtest)也将会很小



根据上述思想,创建训练集 { x ( 1 ) , . . . , x ( m ) } \{x^{(1)}, ..., x^{(m)}\} {x(1),...,x(m)}、验证集 { x v a l ( 1 ) , . . . , x v a l ( m v a l ) } \{x^{(1)}_{val}, ..., x^{(m_{val})}_{val}\} {xval(1),...,xval(mval)}、测试集 { x t e s t ( 1 ) , . . . , x t e s t ( m t e s t ) } \{x^{(1)}_{test}, ..., x^{(m_{test})}_{test}\} {xtest(1),...,xtest(mtest)},其中训练集为正常数据(都是正常数据即无标签,则仍为无监督学习),验证集和测试集包含部分异常数据。

值得注意的是,异常数据往往远小于正常数据,即为倾斜数据集,此时需要使用 精确率、召回率 F 1 F_1 F1 score 评价模型性能指标。

利用上述方法,通过验证集调整模型超参数(此处为阈值 ε \varepsilon ε)






呈高斯分布的特征利于模型拟合数据;对于不呈高斯分布的特征,考虑不使用该特征或将其变换为高斯分布(例如用 log ⁡ x + b \log{x+b} logx+b x \sqrt{x} x 等取代 x x x)
利用如下方法观测 x x x分布或者 x x x变换的分布:

import matplotlib.pyplot as plt
plt.hist(x, bins = 50)
plt.hist(x**0.5, bins = 50)
plt.hist(np.log(x+10), bins = 50)





def estimate_gaussian(X): 
    Calculates mean and variance of all features 
    in the dataset
        X (ndarray): (m, n) Data matrix
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features

    m, n = X.shape
    # calculate the mean of every feature
    mu = 1 / m * np.sum(X, axis = 0)
    # calculate the variance of every feature
    var = 1 / m * np.sum((X - mu) ** 2, axis = 0)
    return mu, var


def select_threshold(y_val, p_val): 
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold

    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    step_size = (max(p_val) - min(p_val)) / 1000
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    	# calculate predictions for each example using epsilon as threshold
        predictions = (p_val < epsilon)
		# calculate number of true positives
        tp = np.sum((predictions == 1) & (y_val == 1))
        # calculate number of false positives
        fp = fp = sum((predictions == 1) & (y_val == 0))
        # calculate number of false negatives
        fn = fn = np.sum((predictions == 0) & (y_val == 1))

        # calculate precision
        prec = tp / (tp + fp)
        # calculate recall
        rec = tp / (tp + fn)

        # calculate F1 score
        F1 = 2 * prec * rec / (prec + rec)
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
    return best_epsilon, best_F1


# 训练
mu, var = estimate_gaussian(X_train)  
p = multivariate_gaussian(X_train, mu, var)
# 验证
p_val = multivariate_gaussian(X_val, mu, var)
epsilon, F1 = select_threshold(y_val, p_val)
# 找出异常
outliers = p < epsilon

推荐系统(recommended systems)




r ( i , j ) r(i,j) r(i,j)scalar; = 1 if user j j j rated game i i i ,= 0 otherwise
y ( i , j ) y(i,j) y(i,j)scalar; = rating given by user j j j on game i i i (if r ( i , j ) r(i,j) r(i,j) = 1 is defined)
w ( j ) \mathbf{w}^{(j)} w(j)vector; parameters for user j j j
b ( j ) b^{(j)} b(j)scalar; parameter for user j j j
m ( j ) m^{(j)} m(j)scalar; number of movies rated by user j j j
n ( i ) n^{(i)} n(i)scalar; number of users rate movie j j j
x ( i ) \mathbf{x}^{(i)} x(i)vector; feature ratings for movie i i i
n u n_u nunumber of users
n m n_m nmnumber of movies
n n nnumber of features
X \mathbf{X} Xmatrix of vectors x ( i ) \mathbf{x}^{(i)} x(i)
W \mathbf{W} Wmatrix of vectors w ( j ) \mathbf{w}^{(j)} w(j)
b \mathbf{b} bvector of bias parameters b ( j ) b^{(j)} b(j)
R \mathbf{R} Rmatrix of elements r ( i , j ) r(i,j) r(i,j)
x m ( i ) \mathbf{x}_{m}^{(i)} xm(i)vector; feature of movie i i i
x u ( j ) \mathbf{x}_{u}^{(j)} xu(j)vector; feature of user j j j

其中 w ( j ) 、 x ( i ) \mathbf{w}^{(j)}、\mathbf{x}^{(i)} w(j)x(i)长度为 n n n

预测用户 j j j对于产品 i i i的评价:
y ^ ( i , j ) = w ( j ) ⋅ x ( i ) + b ( j ) (5) \hat{y}(i,j)=\mathbf{w}^{(j)}·\mathbf{x}^{(i)}+b^{(j)}\tag{5} y^(i,j)=w(j)x(i)+b(j)(5)
对于已知产品特征及用户评分,需要学习参数 w ( j ) 、 b ( j ) \mathbf{w}^{(j)}、b^{(j)} w(j)b(j)以获得用户偏好模型。
对于用户 j j j,其代价函数为:
J ( w ( j ) , b ( j ) ) = 1 2 m ( j ) ∑ i : r ( i , j ) = 1 ( w ( j ) ⋅ x ( i ) + b ( j ) − y ( i , j ) ) 2 + λ 2 m ( j ) ∑ k = 1 n ( w k ( j ) ) 2 (6) J({\mathbf{w}^{(j)},b^{(j)}})= \frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2m^{(j)}}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2 \tag{6} J(w(j),b(j))=2m(j)1i:r(i,j)=1(w(j)x(i)+b(j)y(i,j))2+2m(j)λk=1n(wk(j))2(6)

由于式 ( 6 ) (6) (6) m ( j ) m^{(j)} m(j)为常数,不影响最小化代价函数所得参数,故可将其约去。
J ( w ( 1 ) , b ( 1 ) , . . . , w ( n u ) , b ( n u ) ) = 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( w ( j ) ⋅ x ( i ) + b ( j ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( w k ( j ) ) 2 (7) J(\mathbf{w}^{(1)},b^{(1)},...,\mathbf{w}^{(n_u)},b^{(n_u)})= \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2 \tag{7} J(w(1),b(1),...,w(nu),b(nu))=21j=1nui:r(i,j)=1(w(j)x(i)+b(j)y(i,j))2+2λj=1nuk=1n(wk(j))2(7)


由于已有多个用户对于产品 i i i的评价,可通过已知用户偏好( w ( j ) 、 b ( j ) \mathbf{w}^{(j)}、b^{(j)} w(j)b(j))猜测产品特征 x ( i ) \mathbf{x}^{(i)} x(i),则代价函数为:
J ( x ( i ) ) = 1 2 ∑ i : r ( i , j ) = 1 ( w ( j ) ⋅ x ( i ) + b ( j ) − y ( i , j ) ) 2 + λ 2 ∑ k = 1 n ( x k ( j ) ) 2 (8) J({\mathbf{x}^{(i)}})= \frac{1}{2}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{8} J(x(i))=21i:r(i,j)=1(w(j)x(i)+b(j)y(i,j))2+2λk=1n(xk(j))2(8)
最小化式 ( 8 ) (8) (8),可以基于评价过产品 i i i的用户数据得到对于产品 i i i的特征。
J ( x ( 1 ) , . . . , x ( n m ) ) = 1 2 ∑ i = 1 n m ∑ i : r ( i , j ) = 1 ( w ( j ) ⋅ x ( i ) + b ( j ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( j ) ) 2 (9) J(\mathbf{x}^{(1)},...,\mathbf{x}^{(n_m)})= \frac{1}{2}\sum_{i=1}^{n_m}\sum_{i:r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{9} J(x(1),...,x(nm))=21i=1nmi:r(i,j)=1(w(j)x(i)+b(j)y(i,j))2+2λi=1nmk=1n(xk(j))2(9)

协同过滤算法(collaborative filtering)

综上,对于用户偏好模型和产品特征都未知的情况,可联立式 ( 7 ) 、 ( 9 ) (7)、(9) (7)(9),得:
J ( x , w , b ) = 1 2 ∑ ( i , j ) : r ( i , j ) = 1 ( w ( j ) ⋅ x ( i ) + b ( j ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( w k ( j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( j ) ) 2 (10) J(\mathbf{x},\mathbf{w},b)= \frac{1}{2}\sum_{(i,j):r(i,j)=1}^{}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2+ \color{red}\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\mathbf{w}^{(j)}_k)^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(\mathbf{x}^{(j)}_k)^2 \tag{10} J(x,w,b)=21(i,j):r(i,j)=1(w(j)x(i)+b(j)y(i,j))2+2λj=1nuk=1n(wk(j))2+2λi=1nmk=1n(xk(j))2(10)
最小化式 ( 10 ) (10) (10),可获得产品特征及用户偏好模型。
利用梯度下降优化参数 x , w , b \mathbf{x},\mathbf{w},b x,w,b
{ w l ( j ) = w l ( j ) − α ∂ ∂ w l ( j ) J ( x , w , b ) b ( j ) = b ( j ) − α ∂ ∂ b ( j ) J ( x , w , b ) x k ( i ) = x k ( i ) − α ∂ ∂ x k ( i ) J ( x , w , b ) (11) \left\{\begin{matrix} \mathbf{w}^{(j)}_l=\mathbf{w}^{(j)}_l-\alpha\frac{\partial}{\partial\mathbf{w}^{(j)}_l} J(\mathbf{x},\mathbf{w},b) \\ b^{(j)}=b^{(j)}-\alpha\frac{\partial}{\partial b^{(j)}} J(\mathbf{x},\mathbf{w},b) \\ \mathbf{x}^{(i)}_k=\mathbf{x}^{(i)}_k-\alpha\frac{\partial}{\partial\mathbf{x}^{(i)}_k} J(\mathbf{x},\mathbf{w},b) \end{matrix}\right. \tag{11} wl(j)=wl(j)αwl(j)J(x,w,b)b(j)=b(j)αb(j)J(x,w,b)xk(i)=xk(i)αxk(i)J(x,w,b)(11)



上述方法中用户对于产品的评价 y ( i , j ) y(i,j) y(i,j)为标量,意义为“评分”。对于用户给予产品“二进制评价”(喜欢或不喜欢)的情况,协同过滤算法仍然适用。

对于二进制评价的情况, y ( i , j ) = 1 y(i,j)=1 y(i,j)=1表示用户喜欢该产品, y ( i , j ) = 0 y(i,j)=0 y(i,j)=0表示用户不喜欢该产品。

类似逻辑回归,通过下式预测 y ( i , j ) = 1 y(i,j)=1 y(i,j)=1的概率:
f x , w , b ( x ( i ) ) = g ( w ( j ) ⋅ x ( i ) + b ( j ) ) = 1 1 + e − ( w ( j ) ⋅ x ( i ) + b ( j ) ) (12) f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)})=g(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} ) =\frac{1}{1+e^{-(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} )}}\tag{12} fx,w,b(x(i))=g(w(j)x(i)+b(j))=1+e(w(j)x(i)+b(j))1(12)
L ( f x , w , b ( x ( i ) ) , y ( i , j ) ) = − y ( i , j ) log ⁡ ( f x , w , b ( x ( i ) ) ) − ( 1 − y ( i , j ) ) log ⁡ ( 1 − f x , w , b ( x ( i ) ) ) (13) L(f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i,j)}) = -y^{(i,j)} \log\left(f_{\mathbf{x},\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i,j)}\right) \log \left( 1 - f_{\mathbf{x},\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{13} L(fx,w,b(x(i)),y(i,j))=y(i,j)log(fx,w,b(x(i)))(1y(i,j))log(1fx,w,b(x(i)))(13)
J ( x , w , b ) = ∑ ( i , j ) : r ( i , j ) = 1 L ( f x , w , b ( x ( i ) ) , y ( i , j ) ) (14) J(\mathbf{x},\mathbf{w},b)=\sum_{(i,j):r(i,j)=1}L(f_{\mathbf{x},\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i,j)})\tag{14} J(x,w,b)=(i,j):r(i,j)=1L(fx,w,b(x(i)),y(i,j))(14)

均值归一化(mean normalization)

利用式 ( 10 ) (10) (10)对用户偏好模型进行拟合时,若存在用户 p p p未对任何产品做出评价,则最小化式 ( 10 ) (10) (10)将使用户参数 w ( p ) \mathbf{w}^{(p)} w(p)为0(对于任一产品 i i i y ( i , p ) = 0 y(i,p)=0 y(i,p)=0,即 w ( p ) \mathbf{w}^{(p)} w(p)对式 ( 10 ) (10) (10)中损失函数项无影响;而由于存在正则项将导致 w ( p ) \mathbf{w}^{(p)} w(p)尽可能小)。
即对于新用户 p p p,算法将预测其对于所有产品评价均为0,难以向其推荐产品。

计算产品 i i i所获评价均值 y ˉ ( i ) = ∑ i : r ( i , j ) = 1 y ( i , j ) n ( i ) \bar{y}^{(i)}=\sum_{i:r(i,j)=1}^{} \frac{y(i,j)}{n^{(i)}} yˉ(i)=i:r(i,j)=1n(i)y(i,j),则各用户对各产品评价 y ′ ( i , j ) = y ( i , j ) − y ˉ ( i ) y'(i,j)=y(i,j)-\bar{y}^{(i)} y(i,j)=y(i,j)yˉ(i)

经过上述均值归一化,对于用户 j j j,预测其对于产品 i i i的评价满足:
y ^ ( i , j ) = w ( j ) ⋅ x ( i ) + b ( j ) + y ˉ ( i ) (15) \hat{y}(i,j)=\mathbf{w}^{(j)}·\mathbf{x}^{(i)}+b^{(j)}+\bar{y}^{(i)}\tag{15} y^(i,j)=w(j)x(i)+b(j)+yˉ(i)(15)

则对于新用户 p p p,算法预测其对于产品 i i i的评价为 y ^ ( i , p ) = y ˉ ( i ) \hat{y}(i,p)=\bar{y}^{(i)} y^(i,p)=yˉ(i),即可根据老用户对于各产品的评价向新用户推荐产品。

同样,对用户 j j j所有评价进行均值归一化获得 y ˉ ( j ) \bar{y}^{(j)} yˉ(j),可以将新产品推荐给倾向于给予好评的用户。


通过协同过滤算法获得了最优参数 x , w , b \mathbf{x},\mathbf{w},b x,w,b。通过 w , b \mathbf{w},b w,b可以预测用户对新产品的打法从而进行推荐。而通过 x \mathbf{x} x可以寻找类似产品进行推荐。

对于产品 i i i的特征 x ( i ) x^{(i)} x(i),通过计算其他产品 k k k的特征 x ( k ) x^{(k)} x(k)与其之间范数:
∣ ∣ x ( k ) − x ( i ) ∣ ∣ 2 = ∑ l = 1 n ( x ( k ) − x ( i ) ) 2 (16) ||x^{(k)}-x^{(i)}||^2= \sum^{n}_{l=1}(x^{(k)}-x^{(i)})^2\tag{16} ∣∣x(k)x(i)2=l=1n(x(k)x(i))2(16)
则可找到与产品 i i i相似的产品,并以此做推荐。



J ( w ) = ( w ⋅ x − y ) 2 J(w)=(w·x-y)^2 J(w)=(wxy)2

w = tf.Variable(3.0)
# tf.Variable is the parameters we want to optimize
x = 1.0 # input feature
y = 1.0 # label
alpha = 0.01 #learning rate

iterations = 30
for iter in range(iterations):
	# use Tensorflow's Graident tape to record the steps used to 
	# compute the costJ,to enable auto differentiation.
	with tf.GradientTape() as tape:
		fwb = w*x
		costJ = (fwb - y)**2 # cost function
	# use the graident tape to calculate the graidents of the cost
	# with respect to the parameter w
	[dJdw] = tape.gradient(costJ , [w])
	# run one step of graident descent by updating the value of w
	# to reduce the cost
	w.assign_add(-alpha * dJdw)

协同过滤算法代价函数,即式 ( 10 ) (10) (10)

def cofi_cost_func(X, W, b, Y, R, lambda_):
    Returns the cost for the content-based filtering
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
      J (float) : Cost
    nm, nu = Y.shape
    J = 0
    for j in range(nu):
        w = W[j,:]
        b_j = b[0,j]
        for i in range(nm):
            x = X[i,:]
            y = Y[i,j]
            r = R[i,j]
            J += np.square(r * (np.dot(w,x) + b_j - y ) )
    J += lambda_* (np.sum(np.square(W)) + np.sum(np.square(X)))
    J = J/2

    return J

( 10 ) (10) (10)矢量化实现(更高效):

def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
      J (float) : Cost
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

利用Adam优化器实现梯度下降算法,即式 ( 11 ) (11) (11)

movieList, movieList_df = load_Movie_List_pd()

# Reload ratings and add new ratings
Y, R = load_ratings_small()
Y    = np.c_[my_ratings, Y]
R    = np.c_[(my_ratings != 0).astype(int), R]
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)
# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

iterations = 200
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )


对于用户或产品,其都存在一些自身特征:如包括年龄、国籍、平均打分等在内的用户特征 x u ( j ) \mathbf{x}_{u}^{(j)} xu(j);包括产地、价格、平均得分等在内的产品特征 x m ( i ) \mathbf{x}_{m}^{(i)} xm(i)



x u ( j ) 、 x m ( i ) \mathbf{x}_{u}^{(j)}、\mathbf{x}_{m}^{(i)} xu(j)xm(i)尺寸不尽相同,试图分别找到二者的同尺寸提取 v u ( j ) 、 v m ( i ) \mathbf{v}_{u}^{(j)}、\mathbf{v}_{m}^{(i)} vu(j)vm(i)以进行匹配。

利用神经网络进行提取,其中"用户网络"、“产品网络”的输入层尺寸分别与 x u ( j ) 、 x m ( i ) \mathbf{x}_{u}^{(j)}、\mathbf{x}_{m}^{(i)} xu(j)xm(i)尺寸对应,输出层尺寸与 v u ( j ) \mathbf{v}_{u}^{(j)} vu(j) v m ( i ) \mathbf{v}_{m}^{(i)} vm(i)一致。

计算用户 j j j与产品 i i i的匹配度(评分):
P = v u ( j ) ⋅ v m ( i ) (17) P= \mathbf{v}_{u}^{(j)}·\mathbf{v}_{m}^{(i)}\tag{17} P=vu(j)vm(i)(17)
对于二进制标签可通过sigmoid函数计算用户 j j j匹配(喜欢)产品 i i i的概率:
P = g ( v u ( j ) ⋅ v m ( i ) ) (18) P=g( \mathbf{v}_{u}^{(j)}·\mathbf{v}_{m}^{(i)})\tag{18} P=g(vu(j)vm(i))(18)

J = ∑ ( i , j ) : r ( i , j ) = 1 ( P − y ( i , j ) ) 2 + 神经网络正则项 J=\sum_{(i,j):r(i,j)=1}^{}(P - y^{(i,j)})^2+ \color{red}神经网络正则项 J=(i,j):r(i,j)=1(Py(i,j))2+神经网络正则项


类似协同过滤算法式 ( 16 ) (16) (16),利用式 ( 19 ) (19) (19)寻找相似产品
∣ ∣ v m ( i ) − v m ( k ) ∣ ∣ 2 = ∑ l = 1 n ( v m ( i ) − v m ( k ) ) 2 (19) ||\mathbf{v}_{m}^{(i)}-\mathbf{v}_{m}^{(k)}||^2= \sum^{n}_{l=1}(\mathbf{v}_{m}^{(i)}-\mathbf{v}_{m}^{(k)})^2\tag{19} ∣∣vm(i)vm(k)2=l=1n(vm(i)vm(k))2(19)


实际应用中,计算大量产品与某用户间的匹配度 P P P是不现实的,可利用检索&排名(Retrieval&Ranking)
检索 (利用用户特征及产品特征粗略找出产品):

  • 与该用户喜爱产品类似的产品(通过式 ( 19 ) (19) (19))
  • 用户喜爱的产品类型中最受欢迎的产品
  • 该地区最受欢迎的产品


排名 :计算上述粗选产品与用户匹配度并排名,此为最终推荐目录

num_outputs = 32
user_NN = tf.keras.models.Sequential([  
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),

item_NN = tf.keras.models.Sequential([   
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = Model([input_user, input_item], output)

# specify the cost funcion
cost_fn = tf.keras.losses.MeanSquaredError()


  • 协同过滤推荐算法:
    基于其他用户的评价通过 预测 进行推荐
  • 内容过滤推荐算法:
    基于用户或者商品特征通过 匹配 进行推荐
Mathematica for Machine Learning机器学习的Mathematica)是一份关于使用Mathematica进行机器学习笔记。Mathematica是一种功能强大的数学软件包,在处理和分析数据方面非常有用。使用Mathematica,我们可以使用其内置的机器学习函数和算法进行数据建模、预测和分类。 笔记中可能包含以下内容: 1. 数据准备:读取和处理数据是机器学习的第一步。Mathematica提供了各种函数和工具来读取和处理数据。这些函数可以从各种数据源中读取数据,并进行数据清洗、转换和归一化。 2. 特征工程:特征工程是机器学习中至关重要的一步,它涉及将原始数据转换为更有信息量的特征。Mathematica提供了各种函数和工具来进行特征选择、提取和变换。 3. 模型选择和训练:Mathematica提供了各种机器学习算法和函数,可以帮助我们选择适当的模型,并使用训练数据对模型进行训练。这些算法包括回归、分类、聚类和降维等。 4. 模型评估和验证:一旦模型训练完成,需要对其进行评估和验证。Mathematica提供了各种性能评估指标和图形化工具来评估和比较不同的模型。 5. 预测和推断:一旦我们有了训练好的模型,我们可以使用Mathematica进行预测和推断。该软件包提供了函数和工具,可以使用模型对新数据进行预测,并生成相关的可视化结果。 6. 高级机器学习功能:Mathematica还提供了一些高级的机器学习功能,如深度学习和强化学习。这些功能可以帮助我们解决更复杂的机器学习问题。 总之,Mathematica for Machine Learning提供了许多有用的函数和工具,可以帮助我们在机器学习中进行数据处理、模型选择和训练、模型评估和预测等任务。通过学习和使用这些笔记,我们可以更好地理解和应用机器学习算法并解决实际问题。


