吴恩达cs229|编程作业第八周(Python)

练习八:异常检测和推荐系统

目录:

1.包含的文件

2.异常检测

3.推荐系统

1.包含的文件

文件名含义
ex8.py异常检测实验
ex8_cofi.py推荐系统实验
ex8data1.mat异常检测数据集1
ex8data2.mat异常检测数据集2
ex8_movies.mat电影评分数据集
ex8_movieParams.mat参数优化
multivariateGaussian.py多元高斯分布
visualizeFit.py数据可视化
checkCostFunction.py协同过滤的梯度检查
computeNumericalGradient.py近似梯度计算
loadMovieList.py加载电影列表
movie_ids.txt电影名字列表
normalizeRatings.py协同过滤均值规范化
estimateGaussian.py高斯分布参数估计
selectThreshold.py异常检测的阈值设置
cofiCostFunc.py实现协同过滤的代价函数

注:红色部分需要自己填写。

2.异常检测

  • 导入需要的包以及初始化:
import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio

import estimateGaussian as eg
import multivariateGaussian as mvg
import visualizeFit as vf
import selectThreshold as st

plt.ion()
# np.set_printoptions(formatter={'float': '{: 0.6f}'.format})

2.1数据可视化

# ===================== Part 1: Load Example Dataset =====================
# We start this exercise by using a small dataset that is easy to visualize.
#
# Our example case consists of two network server statistics across
# several machines: the latency and throughput of each machine.
# This exercise will help us find possibly faulty (or very fast) machines
#

print('Visualizing example dataset for outlier detection.')

#  The following command loads the dataset. You should now have the
#  variables X, Xval, yval in your environment.
data = scio.loadmat('ex8data1.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval'].flatten()

# Visualize the example dataset
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', s=15, linewidth=1)
plt.axis([0, 30, 0, 30])
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

input('Program paused. Press ENTER to continue')
  • 可视化结果

2.2估计概率分布

  • 要执行异常检测,首先需要根据数据的分布匹配模型。高斯分布为:

  • 要估计平均值,可以使用:

  • 对于方差:

  • 编写参数估计程序estimateGaussian.py
import numpy as np


def estimate_gaussian(X):
    # Useful variables
    m, n = X.shape

    # You should return these values correctly
    mu = np.zeros(n)
    sigma2 = np.zeros(n)

    # ===================== Your Code Here =====================
    # Instructions: Compute the mean of the data and the variances
    #               In particular, mu[i] should contain the mean of
    #               the data for the i-th feature and sigma2[i]
    #               should contain variance of the i-th feature
    #
    mu = (1/m)*X.sum(axis = 0).reshape(1, -1)
    
    sigma2 = ((1/m)*(X - mu)*(X - mu)).sum(axis = 0)

    # ==========================================================

    return mu, sigma2
  • 估计训练集的概率分布
# ===================== Part 2: Estimate the dataset statistics =====================
# For this exercise, we assume a Gaussian distribution for the dataset.
#
# We first estimate the parameters of our assumed Gaussian distribution,
# then compute the probabilities for each of the points and then visualize
# both the overall distribution and where each of the points falls in
# terms of that distribution
#
print('Visualizing Gaussian fit.')

# Estimate mu and sigma2
mu, sigma2 = eg.estimate_gaussian(X)

# Returns the density of the multivariate normal at each data point(row) of X
p = mvg.multivariate_gaussian(X, mu, sigma2)

# Visualize the fit
vf.visualize_fit(X, mu, sigma2)
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

input('Program paused. Press ENTER to continue')
  • 查看计算概率分布的程序multivariateGaussian.py
import numpy as np


def multivariate_gaussian(X, mu, sigma2):
    #特征的个数
    k = mu.size

    #如果是基于单元高斯分布的模型  将其sigma2转换为对角矩阵 作为协方差矩阵 代入多元高斯分布公式
    #此时单元模型和多元模型是等价的
    #如果是基于多元高斯分布的模型 直接将计算的协方差矩阵sigma2代入多元高斯分布公式
    if sigma2.ndim == 1 or (sigma2.ndim == 2 and (sigma2.shape[1] == 1 or sigma2.shape[0] == 1)):
        sigma2 = np.diag(sigma2)

    x = X - mu
    p = (2 * np.pi) ** (-k / 2) * np.linalg.det(sigma2) ** (-0.5) * np.exp(-0.5*np.sum(np.dot(x, np.linalg.pinv(sigma2)) * x, axis=1))

    return p
  • 查看数据可视化程序 visualizeFit.py
import matplotlib.pyplot as plt
import numpy as np
import multivariateGaussian as mvg


def visualize_fit(X, mu, sigma2):
    ##生成网格点
    grid = np.arange(0, 35.5, 0.5)
    x1, x2 = np.meshgrid(grid, grid)
    #得到每个网格点的概率
    Z = mvg.multivariate_gaussian(np.c_[x1.flatten('F'), x2.flatten('F')], mu, sigma2)
    Z = Z.reshape(x1.shape, order='F')

    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], marker='x', c='b', s=15, linewidth=1)

    # Do not plot if there are infinities
    #画出训练集概率分布的等高线图
    if np.sum(np.isinf(X)) == 0:
        lvls = 10 ** np.arange(-20, 0, 3).astype(np.float)
        plt.contour(x1, x2, Z, levels=lvls, colors='r', linewidths=0.7)
  • 可视化结果

2.3选择阈值

  • F1的分数是使用precision (prec)和recall (rec)计算的:

  • 编写阈值选择程序selectThreshold.py
import numpy as np


def select_threshold(yval, pval):
    f1 = 0

    # You have to return these values correctly
    best_eps = 0#最好的阈值参数
    best_f1 = 0#最好的阈值参数对应的f1-score

    for epsilon in np.linspace(np.min(pval), np.max(pval), num=1001):
        # ===================== Your Code Here =====================
        # Instructions: Compute the F1 score of choosing epsilon as the
        #               threshold and place the value in F1. The code at the
        #               end of the loop will compare the F1 score for this
        #               choice of epsilon and set it to be the best epsilon if
        #               it is better than the current choice of epsilon.
        #
        # Note : You can use predictions = pval < epsilon to get a binary vector
        #        of False(0)'s and True(1)'s of the outlier predictions
        #

        predictions = np.less(pval, epsilon)
        tp = np.sum(np.logical_and(predictions, yval))
        fp = np.sum(np.logical_and(predictions, yval == 0))
        fn = np.sum(np.logical_and(np.logical_not(predictions), yval == 1))
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = (2 * precision * recall) / (precision + recall)


        # ==========================================================

        if f1 > best_f1:
            best_f1 = f1
            best_eps = epsilon

    return best_eps, best_f1
  • 基于验证集 选择一个最好的阈值参数
# ===================== Part 3: Find Outliers =====================
# Now you will find a good epsilon threshold using a cross-validation set
# probabilities given the estimated Gaussian distribution
#
pval = mvg.multivariate_gaussian(Xval, mu, sigma2)

epsilon, f1 = st.select_threshold(yval, pval)
print('Best epsilon found using cross-validation: {:0.4e}'.format(epsilon))
print('Best F1 on Cross Validation Set: {:0.6f}'.format(f1))
print('(you should see a value epsilon of about 8.99e-05 and F1 of about 0.875)')

# Find outliers in the training set and plot
outliers = np.where(p < epsilon)
plt.scatter(X[outliers, 0], X[outliers, 1], marker='o', facecolors='none', edgecolors='r')

input('Program paused. Press ENTER to continue')
  • 基于最好的阈值,标出训练集中的异常样本

  precision = tp / (tp + fp)
Best epsilon found using cross-validation: 8.9909e-05
Best F1 on Cross Validation Set: 0.875000
(you should see a value epsilon of about 8.99e-05 and F1 of about 0.875)

2.4高维数据

  • 在大数据集上进行异常检测

异常检测的训练过程是无监督的过程,得到训练集的概率分布,训练集一般全为正常样本,不过有异常样本存在,也没关系。将验证集样本代入训练得到的概率分布和验证集真实label进行比较,来进行模型选择,比如选择阈值参数或是否应该添加新特征等。(https://blog.csdn.net/sdu_hao/article/details/84338471)

# ===================== Part 4: Multidimensional Outliers =====================
# We will now use the code from the previous part and apply it to a
# harder problem in which more features describe each datapoint and only
# some features indicate whether a point is an outlier.
#

# Loads the second dataset.
data = scio.loadmat('ex8data2.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval'].flatten()

# Apply the same steps to the larger dataset
mu, sigma2 = eg.estimate_gaussian(X)

# Training set
p  = mvg.multivariate_gaussian(X, mu, sigma2)

# Cross Validation set
pval = mvg.multivariate_gaussian(Xval, mu, sigma2)

# Find the best threshold
epsilon, f1 = st.select_threshold(yval, pval)

print('Best epsilon found using cross-validation: {:0.4e}'.format(epsilon))
print('Best F1 on Cross Validation Set: {:0.6f}'.format(f1))
print('# Outliers found: {}'.format(np.sum(np.less(p, epsilon))))
print('(you should see a value epsilon of about 1.38e-18, F1 of about 0.615, and 117 outliers)')

input('ex8 Finished. Press ENTER to exit')
  • 测试结果:

Best epsilon found using cross-validation: 1.3772e-18
Best F1 on Cross Validation Set: 0.615385
# Outliers found: 117
(you should see a value epsilon of about 1.38e-18, F1 of about 0.615, and 117 outliers)

3.推荐系统

  • 导入需要的包以及初始化:
import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio
import scipy.optimize as opt

import cofiCostFunction as ccf
import checkCostFunction as cf
import loadMovieList as lm
import normalizeRatings as nr


plt.ion()
np.set_printoptions(formatter={'float': '{: 0.6f}'.format})

3.1电影评级数据集

  • 在这部分练习中,你还会用到矩阵X和theta:

  • 加载数据并可视化:
# ===================== Part 1: Loading movie ratings dataset =====================
# We will start by loading the movie ratings dataset to understand the
# structure of the data
print('Loading movie ratings dataset.')

# Load data
data = scio.loadmat('ex8_movies.mat')
Y = data['Y']
R = data['R']

# Y is a 1682 x 943 2-d ndarray, containing ratings 1-5 of 1682 movies on 943 users
#
# R is a 1682 x 943 2-d ndarray, where R[i, j] = 1 if and only if user j gave a
# rating to movie i

# From the matrix, we can compute statistics like average rating.
print('Average ratings for movie 0(Toy Story): {:0.6f}/5'.format(np.mean(Y[0, np.where(R[0] == 1)])))

# We can visualize the ratings matrix by plotting it with plt.imshow
plt.figure()
plt.imshow(Y)
plt.colorbar()
plt.xlabel('Users')
plt.ylabel('Movies')

input('Program paused. Press ENTER to continue')
  • 可视化结果:

3.2协同过滤学习算法

  • 协同过滤代价函数:

  • 协同过滤梯度:

  • 当加上正则化后,代价函数与梯度为:

  • 编写计算代价的程序cofiCostfunc.py:
import numpy as np


def cofi_cost_function(params, Y, R, num_users, num_movies, num_features, lmd):
    X = params[0:num_movies * num_features].reshape((num_movies, num_features))
    theta = params[num_movies * num_features:].reshape((num_users, num_features))

    # You need to set the following values correctly.
    cost = 0
    X_grad = np.zeros(X.shape)
    theta_grad = np.zeros(theta.shape)

    # ===================== Your Code Here =====================
    # Instructions: Compute the cost function and gradient for collaborative
    #               filtering. Concretely, you should first implement the cost
    #               function (without regularization) and make sure it is
    #               matches our costs. After that, you should implement the
    #               gradient and use the checkCostFunction routine to check
    #               that the gradient is correct. Finally, you should implement
    #               regularization.
    #
    # Notes: X - num_movies x num_features matrix of movie features
    #        theta - num_users x num_features matrix of user features
    #        Y - num_movies x num_users matrix of user ratings of movies
    #        R - num_movies x num_users matrix, where R[i, j] = 1 if the
    #        i-th movie was rated by the j-th user
    #
    # You should set the following variables correctly
    #
    #        X_grad - num_movies x num_features matrix, containing the
    #                 partial derivatives w.r.t. to each element of X
    #        theta_grad - num_users x num_features matrix, containing the
    #                     partial derivatives w.r.t. to each element of theta

    
    hypothesis = (np.dot(X, theta.T) - Y) * R#只计算有打分的

    cost = (1/2)*np.sum(hypothesis**2) + (lmd/2)*np.sum(theta**2) + (lmd/2)*np.sum(X**2)

    X_grad = np.dot(hypothesis, theta) + lmd * X
    theta_grad = np.dot(hypothesis.T, X) + lmd * theta
    # ==========================================================

    grad = np.concatenate((X_grad.flatten(), theta_grad.flatten()))
    #把所有参数的梯度放在一个一维向量中


    return cost, grad
  • 查看梯度检查程序checkCostFunction.py:
import numpy as np
import computeNumericalGradient as cng
import cofiCostFunction as ccf

def check_cost_function(lmd):

    # Create small problem
    #构建电影的特征向量矩阵 4部电影 每部电影3个特征
    x_t = np.random.rand(4, 3)
    #构建用户的喜好向量矩阵 5位用户 对应的3个特征
    theta_t = np.random.rand(5, 3)

    # Zap out most entries
    Y = np.dot(x_t, theta_t.T)  # 4x5
    Y[np.random.rand(Y.shape[0], Y.shape[1]) > 0.5] = 0
    R = np.zeros(Y.shape)
    #Y矩阵有评分的位置 R[i][j]=1
    R[Y != 0] = 1

    # Run Gradient Checking
    #存放电影特征向量梯度的矩阵
    x = np.random.randn(x_t.shape[0], x_t.shape[1])
    #存放用户喜好向量梯度的矩阵
    theta = np.random.randn(theta_t.shape[0], theta_t.shape[1])
    num_users = Y.shape[1]  #5
    num_movies = Y.shape[0]  #4
    num_features = theta_t.shape[1] #3

    def cost_func(p):
        return ccf.cofi_cost_function(p, Y, R, num_users, num_movies, num_features, lmd)

    #返回每个参数的近似梯度  可以理解为2维空间中参数加减一个很小的数得到的弦的斜率(梯度)
    numgrad = cng.compute_numerial_gradient(cost_func, np.concatenate((x.flatten(), theta.flatten())))
    #返回每个参数的真实梯度  可以理解为2维空间中参数的切线斜率(梯度)
    cost, grad = ccf.cofi_cost_function(np.concatenate((x.flatten(), theta.flatten())), Y, R, num_users, num_movies, num_features, lmd)

    print(np.c_[numgrad, grad])
    print('The above two columns you get should be very similar.\n'
          '(Left-Your Numerical Gradient, Right-Analytical Gradient')
    #如果该数值非常小 即2者非常接近 说明没有问题
    diff = np.linalg.norm(numgrad - grad) / np.linalg.norm(numgrad + grad)
    print('If you backpropagation implementation is correct, then\n'
          'the relative difference will be small (less than 1e-9).\n'
          'Relative Difference: {:0.3e}'.format(diff))
  • 查看计算参数近似梯度的程序computeNumericalGradient.py
import numpy as np


def compute_numerial_gradient(cost_func, theta):
    numgrad = np.zeros(theta.size)
    perturb = np.zeros(theta.size)

    #用弦的斜率(梯度)与切线的斜率(梯度)进行比较  来进行梯度检查 如果差不多 就说明没问题
    #对每个参数都进行梯度检查 
    e = 1e-4
    for p in range(theta.size):
        perturb[p] = e
        loss1, grad1 = cost_func(theta - perturb)
        loss2, grad2 = cost_func(theta + perturb)

        numgrad[p] = (loss2 - loss1) / (2 * e)
        perturb[p] = 0

    return numgrad
  • 协同过滤成本函数测试:
# ===================== Part 2: Collaborative Filtering Cost function =====================
# You will now implement the cost function for collaborative filtering.
# To help you debug your cost function, we have included set of weights
# that we trained on that. Specifically, you should complete the code in
# cofiCostFunc.py to return cost.
#

# Load pre-trained weights (X, theta, num_users, num_movies, num_features)
data = scio.loadmat('ex8_movieParams.mat')
X = data['X']
theta = data['Theta']
num_users = data['num_users']
num_movies = data['num_movies']
num_features = data['num_features']

# Reduce the data set size so that this runs faster
num_users = 4
num_movies = 5
num_features = 3
X = X[0:num_movies, 0:num_features]
theta = theta[0:num_users, 0:num_features]
Y = Y[0:num_movies, 0:num_users]
R = R[0:num_movies, 0:num_users]

# Evaluate cost function
cost, grad = ccf.cofi_cost_function(np.concatenate((X.flatten(), theta.flatten())), Y, R, num_users, num_movies, num_features, 0)

print('Cost at loaded parameters: {:0.2f}\n(this value should be about 22.22)'.format(cost))

input('Program paused. Press ENTER to continue')
  • 测试结果:

Cost at loaded parameters: 22.22
(this value should be about 22.22)

  • 测试协同过滤梯度:
# ===================== Part 3: Collaborative Filtering Gradient =====================
# Once your cost function matches up with ours, you should now implement
# the collaborative filtering gradient function. Specifically, you should
# complete the code in cofiCostFunction.py to return the grad argument.
#
print('Checking gradients (without regularization) ...')

# Check gradients by running check_cost_function()
cf.check_cost_function(0)

input('Program paused. Press ENTER to continue')
  • 测试结果:

Checking gradients (without regularization) ...
[[-5.712958 -5.712958]
 [-1.267167 -1.267167]
 [-5.695329 -5.695329]
 [-4.422774 -4.422774]
 [-2.914139 -2.914139]
 [-1.902563 -1.902563]
 [-4.605942 -4.605942]
 [-1.229545 -1.229545]
 [ 0.991195  0.991195]
 [-2.185452 -2.185452]
 [-1.164497 -1.164497]
 [-0.880815 -0.880815]
 [ 0.169959  0.169959]
 [ 1.130411  1.130411]
 [-2.910493 -2.910493]
 [ 0.000000  0.000000]
 [ 0.000000  0.000000]
 [ 0.000000  0.000000]
 [ 0.639982  0.639982]
 [ 2.084916  2.084916]
 [ 0.642260  0.642260]
 [ 3.303854  3.303854]
 [ 3.595377  3.595377]
 [-0.068812 -0.068812]
 [-0.853105 -0.853105]
 [-1.407140 -1.407140]
 [ 0.767546  0.767546]]
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient
If you backpropagation implementation is correct, then
the relative difference will be small (less than 1e-9).
Relative Difference: 9.937e-13

Program paused. Press ENTER to continue

  • 测试协同过滤代价函数正规化:
# ===================== Part 4: Collaborative Filtering Cost Regularization =====================
# Now, you should implement regularization for the cost function for
# collaborative filtering. You can implement it by adding the cost of
# regularization to the original cost computation.
#

# Evaluate cost function
cost, _ = ccf.cofi_cost_function(np.concatenate((X.flatten(), theta.flatten())), Y, R, num_users, num_movies, num_features, 1.5)

print('Cost at loaded parameters (lambda = 1.5): {:0.2f}\n'
      '(this value should be about 31.34)'.format(cost))

input('Program paused. Press ENTER to continue')
  • 测试结果:

Cost at loaded parameters (lambda = 1.5): 31.34
(this value should be about 31.34)

Program paused. Press ENTER to continue

  • 协同滤波梯度正则化:
# ===================== Part 5: Collaborative Filtering Gradient Regularization =====================
# Once your cost matches up with ours, you should proceed to implement
# regularization for the gradient.
#

print('Checking Gradients (with regularization) ...')

# Check gradients by running check_cost_function
cf.check_cost_function(1.5)

input('Program paused. Press ENTER to continue')
  • 测试结果:

Checking Gradients (with regularization) ...
[[-7.124019 -7.124019]
 [-1.718425 -1.718425]
 [ 11.358743  11.358743]
 [-3.972648 -3.972648]
 [-5.995942 -5.995942]
 [ 10.221437  10.221437]
 [-1.027387 -1.027387]
 [-1.052520 -1.052520]
 [-1.275051 -1.275051]
 [-0.704545 -0.704545]
 [-1.757694 -1.757694]
 [-27.506948 -27.506948]
 [-0.236431 -0.236431]
 [ 4.353573  4.353573]
 [-2.711077 -2.711077]
 [ 0.231574  0.231574]
 [ 3.341633  3.341633]
 [ 7.980645  7.980645]
 [ 5.110509  5.110509]
 [-0.425383 -0.425383]
 [-5.251634 -5.251634]
 [-2.903915 -2.903915]
 [ 2.326650  2.326650]
 [ 19.823389  19.823389]
 [-1.351773 -1.351773]
 [ 7.605062  7.605062]
 [ 18.628170  18.628170]]
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient
If you backpropagation implementation is correct, then
the relative difference will be small (less than 1e-9).
Relative Difference: 2.624e-12

Program paused. Press ENTER to continue

3.3学习电影推荐

  • 对电影评分测试:
# ===================== Part 6: Entering ratings for a new user =====================
# Before we will train the collaborative filtering model, we will first
# add ratings that correspond to a new user that we just observed. This
# part of the code will also allow you to put in your own ratings for the
# movies in our dataset!
#
movie_list = lm.load_movie_list()

# Initialize my ratings
my_ratings = np.zeros(len(movie_list))

# Check the file movie_ids.txt for id of each movie in our dataset
# For example, Toy Story (1995) has ID 0, so to rate it "4", you can set
my_ratings[0] = 4

# Or suppose did not enjoy Silence of the lambs (1991), you can set
my_ratings[97] = 2

# We have selected a few movies we liked / did not like and the ratings we
# gave are as follows:
my_ratings[6] = 3
my_ratings[11] = 5
my_ratings[53] = 4
my_ratings[63] = 5
my_ratings[65] = 3
my_ratings[68] = 5
my_ratings[182] = 4
my_ratings[225] = 5
my_ratings[354] = 5

print('New user ratings:\n')
for i in range(my_ratings.size):
    if my_ratings[i] > 0:
        print('Rated {} for {}'.format(my_ratings[i], movie_list[i]))

input('Program paused. Press ENTER to continue')
  • 测试结果:
New user ratings:

Rated 4.0 for Toy Story (1995)
Rated 3.0 for Twelve Monkeys (1995)
Rated 5.0 for Usual Suspects, The (1995)
Rated 4.0 for Outbreak (1995)
Rated 5.0 for Shawshank Redemption, The (1994)
Rated 3.0 for While You Were Sleeping (1995)
Rated 5.0 for Forrest Gump (1994)
Rated 2.0 for Silence of the Lambs, The (1991)
Rated 4.0 for Alien (1979)
Rated 5.0 for Die Hard 2 (1990)
Rated 5.0 for Sphere (1998)

Program paused. Press ENTER to continue
  • 学习电影评级:
# ===================== Part 7: Learning Movie Ratings =====================
# Now, you will train the collaborative filtering model on a movie rating
# dataset of 1682 movies and 943 users
#
print('Training collaborative filtering ...\n'
      '(this may take 1 ~ 2 minutes)')


# Load data
data = scio.loadmat('ex8_movies.mat')
Y = data['Y']
R = data['R']

# Y is a 1682x943 matrix, containing ratings (1-5) of 1682 movies by
# 943 users
#
# R is a 1682x943 matrix, where R[i,j] = 1 if and only if user j gave a
# rating to movie i

# Add our own ratings to the data matrix
Y = np.c_[my_ratings, Y]
R = np.c_[(my_ratings != 0), R]

# Normalize Ratings
Ynorm, Ymean = nr.normalize_ratings(Y, R)

# Useful values
num_users = Y.shape[1]
num_movies = Y.shape[0]
num_features = 10

# Set initial parameters (theta, X)
X = np.random.randn(num_movies, num_features)
theta = np.random.randn(num_users, num_features)

initial_params = np.concatenate([X.flatten(), theta.flatten()])

lmd = 10


def cost_func(p):
    return ccf.cofi_cost_function(p, Ynorm, R, num_users, num_movies, num_features, lmd)[0]


def grad_func(p):
    return ccf.cofi_cost_function(p, Ynorm, R, num_users, num_movies, num_features, lmd)[1]

theta, *unused = opt.fmin_cg(cost_func, fprime=grad_func, x0=initial_params, maxiter=100, disp=False, full_output=True)

# Unfold the returned theta back into U and W
X = theta[0:num_movies * num_features].reshape((num_movies, num_features))
theta = theta[num_movies * num_features:].reshape((num_users, num_features))

print('Recommender system learning completed')
print(theta)

input('Program paused. Press ENTER to continue')
  • 结果:

Training collaborative filtering ...
(this may take 1 ~ 2 minutes)
Recommender system learning completed
[[ 0.140276  0.055803 -0.003292 ...,  0.064256 -0.231818 -0.141949]
 [ 0.247992 -0.330579 -0.687267 ...,  0.093129  0.247808 -0.041921]
 [ 0.303776  0.054555 -0.104952 ..., -0.066091  0.172326 -0.022844]
 ..., 
 [-0.004868 -0.069983 -0.046656 ..., -0.125449  0.004130 -0.256860]
 [ 0.214056  0.309996  0.092434 ..., -0.241557 -0.041111 -0.556043]
 [-0.229418 -0.727139  0.624981 ..., -0.131409 -0.590482  0.044262]]

Program paused. Press ENTER to continue

  • 为你推荐:
# ===================== Part 8: Recommendation for you =====================
# After training the model, you can now make recommendations by computing
# the predictions matrix.
#
p = np.dot(X, theta.T)
my_predictions = p[:, 0] + Ymean

indices = np.argsort(my_predictions)[::-1]
print('\nTop recommendations for you:')
for i in range(10):
    j = indices[i]
    print('Predicting rating {:0.1f} for movie {}'.format(my_predictions[j], movie_list[j]))

print('\nOriginal ratings provided:')
for i in range(my_ratings.size):
    if my_ratings[i] > 0:
        print('Rated {} for {}'.format(my_ratings[i], movie_list[i]))

input('ex8_cofi Finished. Press ENTER to exit')
  • 推荐结果:

Top recommendations for you:
Predicting rating 5.0 for movie Someone Else's America (1995)
Predicting rating 5.0 for movie Santa with Muscles (1996)
Predicting rating 5.0 for movie Prefontaine (1997)
Predicting rating 5.0 for movie Saint of Fort Washington, The (1993)
Predicting rating 5.0 for movie Star Kid (1997)
Predicting rating 5.0 for movie Entertaining Angels: The Dorothy Day Story (1996)
Predicting rating 5.0 for movie Marlene Dietrich: Shadow and Light (1996)
Predicting rating 5.0 for movie They Made Me a Criminal (1939)
Predicting rating 5.0 for movie Great Day in Harlem, A (1994)
Predicting rating 5.0 for movie Aiqing wansui (1994)

Original ratings provided:
Rated 4.0 for Toy Story (1995)
Rated 3.0 for Twelve Monkeys (1995)
Rated 5.0 for Usual Suspects, The (1995)
Rated 4.0 for Outbreak (1995)
Rated 5.0 for Shawshank Redemption, The (1994)
Rated 3.0 for While You Were Sleeping (1995)
Rated 5.0 for Forrest Gump (1994)
Rated 2.0 for Silence of the Lambs, The (1991)
Rated 4.0 for Alien (1979)
Rated 5.0 for Die Hard 2 (1990)
Rated 5.0 for Sphere (1998)

ex8_cofi Finished. Press ENTER to exit

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值