Movie recommendation based on collaborative filtering method

Movie recommendation based on collaborative filtering method
Zhen Zhu
PROPOSAL
Online movie have become increasingly popular in recent ten years, and tremendous volumes movie resources have been upload to internet every day. Also, the recommender system has become a vital part of Internet companies [1]. So, it is quite interesting and important to build an intelligent recommender system to identify the effective of what are user’s real interests. So, we choose the recommendation prediction task on MovieLens datasets for project in this course.
RELATED WORK
Such recommendation problems are usually solved by the collaborative filtering (CF) [3] approach, which relies only on information about the behavior of users in the past. There are two primary methods in CF: the neighborhood approach [4] and latent factor modeling [5]. Both of these methods have been proven to be successful for the recommendation problem. A key step in CF is to combine these models. Methods such as regression, gradient boosting decision trees, and neural networks have been used to ensemble CF models.
OUTLINE OF APPROACH
The framework of our approach shown in figure1
 
Fig.1 the framework of the approach
1. We use Movielens datasets for our project and transform the data form for data model.
2. Use several different algorithms to calculate the similarity in users or items.
3 For user-based method, we need to do neighborhood calculation.
4 Use user-based CF and item-based CF to recommendation.
5 Design a evaluator to compare the results of different approaches.
SIMILARITY CALCULATION
Euclidean similarity
In this way, we calculate similarity between two points by Euclidean distance. The Euclidean distance between point p and q is the length of the line segment connecting them ( ). In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the formula as following:
 
The position of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclidean vectors, starting from the origin of the space, and their tips indicate two points. The Euclidean norm, or Euclidean length, or magnitude of a vector measures the length of the vector:
 
Log-likelihood similarity
The natural logarithm of the likelihood function, called the log-likelihood, is more convenient to work for similarity calculation. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the likelihood itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function.
The distribution has two parameters α and β. The likelihood function is
 .
Finding the maximum likelihood estimate of β for a single observed value x looks rather daunting. Its logarithm is much simpler to work with:
 
Gamma Function:  
Pearson Correlation Similarity
Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient. The formula as following:
 
  is the covariance,  is the standard deviation of  
The formula for ρ can be expressed in terms of mean and expectation. Since
 
  and   are defined as above;   is the mean of ;   is the expectation.
Then the formula for ρ can also be written as
 
Cosine similarity
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1].
The cosine of two vectors can be derived by using the Euclidean dot product formula:
 
Given two vectors of attributes, A and B, the cosine similarity, cos (θ), is represented using a dot product and magnitude as following:
 
The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating de-correlation, and in-between values indicating intermediate similarity or dissimilarity.
For distance calculation, most common way is by following:
  or   
NEIGHBORHOOD CALCULATION
Neighborhood calculation only use for collaborative filtering based on user. Use this calculation to sort top N similar user for final recommendation.
Fixed-size neighborhoods (K-neighborhoods)
K-neighborhoods means the recommendations are derived from a neighborhood of the K most similar users. We need to define a suitable K. If K were smaller. The recommendation would base on fewer similar users, and it would exclude some less-similar users from consideration. If K were larger. The recommendation would base on more users, and it might be add some less-similar users. Only choose suitable K can improve the affection of recommendation.
 
Fig.2 An illustration of defining a neighborhood of most similar users by picking a fixed number of closest neighbors. Distance illustrates similarity: farther means less similar.
Threshold-based neighborhood
What if we don't want to build a neighborhood of the n most similar users, but rather try to pick the “pretty similar” users and ignore everyone else? We could pick a similarity threshold and take any users that are at least that similar. The threshold should be between -1 and 1, since all similarity metrics return similarity values in this range. In the moment, we may use the similarity metric above. 
 
Fig.3 An illustration of defining a neighborhood of most-similar users with a similarity threshold
COLLABORATIVE FILTERING
Collaborative filtering (CF) approaches assume that those who agreed in the past tend to agree again in the future. For example, a collaborative filtering or recommendation system for movie tastes could make predictions about which movie a user should like by given a partial list of that user's tastes (likes or dislikes). CF methods have two important steps, firstly, CF collects taste information from many users. In the second step, using information gleaned from many users to predict users’ interest and recommend Item to user. Researchers have devised a number of collaborative filtering algorithms which mainly can be divided into two main categories, User-based and Item-based algorithms.
User-based CF
User-based CF is also called nearest-neighbor based Collaborative Filtering, it utilize the entire user-item data to generate a prediction. These systems use statistical techniques to find users’ nearest-neighbors, who have the similar preference. Once the nearest-neighborhood of users are found, these systems use algorithms to combine the preferences of neighbors to produce a prediction or top-N recommendation for the target user. The techniques are popular and widely used in practice.
User-based CF algorithms have been popular and successful in several years, but the widespread use has revealed some potential challenges. In practice, users only have purchased few percent of all the items, maybe 1% of 2 million items, so that recommender system based on nearest neighbor algorithms may be unable to make any item recommendations for a particular user, and the accuracy of recommendations may be poor. Nearest neighbor algorithms require computation that grows with the number of users. With millions of users, a typical web-based recommender system running existing algorithms will suffer serious scalability problems. 
Item-based CF
Unlike the User-based CF algorithm discussed above, the item-based approach focus on items which the target user had rated. Then, calculate the similarity to the target item and then selects k most similar items. According to the weight on those item. We find the most similar items to target user which have not been used.
One critical step in the item-based collaborative filtering algorithm is to compute the similarity between items. The basic idea in similarity computation between two items i and j is to build co-occurrence matrix. Then, using the Euclidean dot product to calculate with target user’s rating. After that we can find the most similar items which has the highest score.
EXPERIMENT
Data sets
This datasets were collected by the GroupLens Research Project at the University of Minnesota. The datasets are very suit for do recommendation research. This data set consists of: 100,000 ratings (1-5) from 943 users on 1682 movies and each user has rated at least 20 movies with attributes (age, gender, occupation).The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up – users who had less than 20 ratings or did not have complete demographic information were removed from this data set.
Evaluation
We use precision value and recall value to measure the result.
 
 
C is the number of the item target user like and identified by system.
N is the total number user like in test data.
T is the Identified number by system.
Results
We use both User-based CF and Item-based CF to experiments. Set Number of neighborhood is 2. The value of neighborhood threshold is 0.5 and the recommender number is 5. The training data account for 70% and the testing datasets account for 30%. The result as Table1 and Table2.
TABLE I. EVALUATION ON USER-BASED CF
Similarity algorithm User-based CF
Rating type Neighborhood calculation Precision Recall Difference
Euclidean similarity Score(1-10) K-neighborhoods 0.15406236275801483 0.14404609475032024 0.07389905172235824
Log-likelihood similarity Score(1-10) K-neighborhoods 0.11314553990610342 0.13617157490396936 0.8677294206556467
Cosine similarity Score(1-10) K-neighborhoods 0.15008714596949832 0.12059325650874954 1.1111111111111112
Pearson Correlation Similarity Score (1-10) K-neighborhoods 0.12151979565772673 0.10734101579171995 0.5506329113924048
Euclidean similarity Boolean(0,1) K-neighborhoods 0.08128205128205122 0.1065941101152369 2.8610988410290643
Log-likelihood similarity Boolean(0,1) K-neighborhoods 0.11395646606914228 0.13107127614169872 2.324895084732162
Cosine similarity Boolean(0,1) K-neighborhoods 0.09692307692307693 0.11662398634229627 2.6268349661343318
Pearson Correlation Similarity Boolean(0,1) K-neighborhoods 0.0725752508361203 0.07543747332479718 2.616772097320647
Euclidean similarity Score(1-10) Threshold-based neighborhood 0.03057017543859648 0.04244558258642761 0.5040645832107125
Log-likelihood similarity Score(1-10) Threshold-based neighborhood 0.0035851472471190786 0.0036491677336747777 0.8119956200953123
Cosine similarity Score (1-10) Threshold-based neighborhood 0.008903225806451634 0.008578745198463503 0.8116693771906173
Pearson Correlation Similarity Score (1-10) Threshold-based neighborhood 0.04111349036402571 0.03252240717029451 0.6220533633729025
Euclidean similarity Boolean(0,1) Threshold-based neighborhood 0.20906735751295338 0.26429790866410546 6.877570770773614
Log-likelihood similarity Boolean(0,1) Threshold-based neighborhood 0.10653008962868132 0.1333546734955189 82.5830467048691
Cosine similarity Boolean(0,1) Threshold-based neighborhood 0.224358974358974 0.2836961160904817 85.7349085590648
Pearson Correlation Similarity Boolean(0,1) Threshold-based neighborhood 0.17935943060498222 0.1814127187366625 5.8100268562854955
TABLE II. EVALUATION ON ITEM-BASED CF
Similarity algorithm Item-based CF
Rating type Precision Recall Difference
Euclidean similarity Score(1-10) 0.0017925736235595393 0.0017925736235595393 0.7945934427298251
Log-likelihood similarity Score(1-10) 0.0 0.0 0.8183123680491949
Cosine similarity Score(1-10) 0.0 0.0 0.8303853550496187
Pearson Correlation Similarity Score (1-10) 0.00870678617157492 0.009240290226205712 0.6706314833178705
Euclidean similarity Boolean(0,1) 0.0025608194622279107 0.0025608194622279107 47.13259472322781
Log-likelihood similarity Boolean(0,1) 0.12522407170294492 0.1432565087494666 91.85028229060038
Cosine similarity Boolean(0,1) 0.07221510883482725 0.07407170294494238 106.26153096575278
Pearson Correlation Similarity Boolean(0,1) 0.0015364916773367482 0.0015364916773367482 0.7373257024463155
Slope one
Score(1-10) 0.0015364916773367482 0.0015364916773367482 0.7373257024463155
Unexpectedly, the experiments results of User-based are mostly better than those of Item-based. Only prediction on rating on Boolean data are better than rating on score. Use threshold to choose neighborhoods are better than use fixed size. The approach of cosine to calculate the similarity is better than others. Value of P reach to about 22.4% and value of R reach to 28.4%.


CONCLUSIONS
We find Similarity algorithm and CF approaches are sensitive to the datasets. Item-based method do better in many application scenarios. However, in this datasets are lower than User-based method. So, we may use different methods in different datasets. If it possible, we may use several approaches for testing and make the decision based on experimental results. 
REFERENCES
[1]M. Hao, Z. Dengyong, L. Chao, L. M. R., and K. Irwin. Recommender systems with social regularization. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, WSDM '11, pages 287{296, New York, NY, USA, 2011. ACM.
[2]Yanzhi Niu, Yi Wang, Gordon Sun, Aden Yue, Brian Dalessandro, Claudia Perlich, Ben Hammer, 2012. The Tencent Dataset and KDD-Cup'12. KDD-Cup Workshop 
[3]Y. Niu, Y. Wang, G. Sun, A. Yue, B. Dalessandro, C. Perlich, and B. Hamner. The Tencent dataset and kdd-cup'12. In KDD-Cup Workshop, 2012., 2012.
[4]Y. Koren. 2009. The bellkor solution to the netflix grand prize. Tech. report 
[5]M. Piotte and M. Chabbert. 2009. The Pragmatic Theory solution to the Netflix Grand prize. Tech. report 
[6] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1257-1264, 2008
[7] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30-37, 2009.

1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
应用背景为变电站电力巡检,基于YOLO v4算法模型对常见电力巡检目标进行检测,并充分利用Ascend310提供的DVPP等硬件支持能力来完成流媒体的传输、处理等任务,并对系统性能做出一定的优化。.zip深度学习是机器学习的一个子领域,它基于人工神经网络的研究,特别是利用多层次的神经网络来进行学习和模式识别。深度学习模型能够学习数据的高层次特征,这些特征对于图像和语音识别、自然语言处理、医学图像分析等应用至关重要。以下是深度学习的一些关键概念和组成部分: 1. **神经网络(Neural Networks)**:深度学习的基础是人工神经网络,它是由多个层组成的网络结构,包括输入层、隐藏层和输出层。每个层由多个神经元组成,神经元之间通过权重连接。 2. **前馈神经网络(Feedforward Neural Networks)**:这是最常见的神经网络类型,信息从输入层流向隐藏层,最终到达输出层。 3. **卷积神经网络(Convolutional Neural Networks, CNNs)**:这种网络特别适合处理具有网格结构的数据,如图像。它们使用卷积层来提取图像的特征。 4. **循环神经网络(Recurrent Neural Networks, RNNs)**:这种网络能够处理序列数据,如时间序列或自然语言,因为它们具有记忆功能,能够捕捉数据中的时间依赖性。 5. **长短期记忆网络(Long Short-Term Memory, LSTM)**:LSTM 是一种特殊的 RNN,它能够学习长期依赖关系,非常适合复杂的序列预测任务。 6. **生成对抗网络(Generative Adversarial Networks, GANs)**:由两个网络组成,一个生成器和一个判别器,它们相互竞争,生成器生成数据,判别器评估数据的真实性。 7. **深度学习框架**:如 TensorFlow、Keras、PyTorch 等,这些框架提供了构建、训练和部署深度学习模型的工具和库。 8. **激活函数(Activation Functions)**:如 ReLU、Sigmoid、Tanh 等,它们在神经网络中用于添加非线性,使得网络能够学习复杂的函数。 9. **损失函数(Loss Functions)**:用于评估模型的预测与真实值之间的差异,常见的损失函数包括均方误差(MSE)、交叉熵(Cross-Entropy)等。 10. **优化算法(Optimization Algorithms)**:如梯度下降(Gradient Descent)、随机梯度下降(SGD)、Adam 等,用于更新网络权重,以最小化损失函数。 11. **正则化(Regularization)**:技术如 Dropout、L1/L2 正则化等,用于防止模型过拟合。 12. **迁移学习(Transfer Learning)**:利用在一个任务上训练好的模型来提高另一个相关任务的性能。 深度学习在许多领域都取得了显著的成就,但它也面临着一些挑战,如对大量数据的依赖、模型的解释性差、计算资源消耗大等。研究人员正在不断探索新的方法来解决这些问题。
深度学习是机器学习的一个子领域,它基于人工神经网络的研究,特别是利用多层次的神经网络来进行学习和模式识别。深度学习模型能够学习数据的高层次特征,这些特征对于图像和语音识别、自然语言处理、医学图像分析等应用至关重要。以下是深度学习的一些关键概念和组成部分: 1. **神经网络(Neural Networks)**:深度学习的基础是人工神经网络,它是由多个层组成的网络结构,包括输入层、隐藏层和输出层。每个层由多个神经元组成,神经元之间通过权重连接。 2. **前馈神经网络(Feedforward Neural Networks)**:这是最常见的神经网络类型,信息从输入层流向隐藏层,最终到达输出层。 3. **卷积神经网络(Convolutional Neural Networks, CNNs)**:这种网络特别适合处理具有网格结构的数据,如图像。它们使用卷积层来提取图像的特征。 4. **循环神经网络(Recurrent Neural Networks, RNNs)**:这种网络能够处理序列数据,如时间序列或自然语言,因为它们具有记忆功能,能够捕捉数据中的时间依赖性。 5. **长短期记忆网络(Long Short-Term Memory, LSTM)**:LSTM 是一种特殊的 RNN,它能够学习长期依赖关系,非常适合复杂的序列预测任务。 6. **生成对抗网络(Generative Adversarial Networks, GANs)**:由两个网络组成,一个生成器和一个判别器,它们相互竞争,生成器生成数据,判别器评估数据的真实性。 7. **深度学习框架**:如 TensorFlow、Keras、PyTorch 等,这些框架提供了构建、训练和部署深度学习模型的工具和库。 8. **激活函数(Activation Functions)**:如 ReLU、Sigmoid、Tanh 等,它们在神经网络中用于添加非线性,使得网络能够学习复杂的函数。 9. **损失函数(Loss Functions)**:用于评估模型的预测与真实值之间的差异,常见的损失函数包括均方误差(MSE)、交叉熵(Cross-Entropy)等。 10. **优化算法(Optimization Algorithms)**:如梯度下降(Gradient Descent)、随机梯度下降(SGD)、Adam 等,用于更新网络权重,以最小化损失函数。 11. **正则化(Regularization)**:技术如 Dropout、L1/L2 正则化等,用于防止模型过拟合。 12. **迁移学习(Transfer Learning)**:利用在一个任务上训练好的模型来提高另一个相关任务的性能。 深度学习在许多领域都取得了显著的成就,但它也面临着一些挑战,如对大量数据的依赖、模型的解释性差、计算资源消耗大等。研究人员正在不断探索新的方法来解决这些问题。
item-based collaborative filtering recommendation algorithm combining item c是一种基于物品的协同过滤推荐算法,在推荐系统中被广泛应用。该算法的核心思想是通过分析用户对不同物品的行为数据,找出与物品c具有相似特征或相关性较高的其他物品,并将这些物品推荐给用户。 具体来说,item-based collaborative filtering算法首先会构建一个物品相似度矩阵。该矩阵的每个元素表示不同物品之间的相似度程度。物品之间的相似度可以通过计算它们在用户行为上的重合度、关联度或其他相似性指标得出。 在物品相似度矩阵构建完成后,当用户需要进行推荐时,算法会根据用户已有的历史行为数据找出与用户已喜欢或购买的物品c相似的其他物品。对于相似物品集合中的每个物品,算法会根据用户对该物品的评分或其他行为数据,对推荐物品进行排序。最后,算法会返回排名靠前的若干个推荐物品给用户。 通过将物品c与其他物品进行比较,并利用物品相似度矩阵进行排序,item-based collaborative filtering算法可以更加准确地将与用户兴趣相关的物品推荐给用户。同时,它也能够克服用户行为数据稀疏性的缺点,提高推荐的个性化程度。 总的来说,item-based collaborative filtering recommendation algorithm combining item c是一种有效的推荐算法,可以根据用户已有的历史行为数据找出与物品c相似的其他物品,并将这些物品按照用户的兴趣进行推荐。这种算法在实际应用中有着广泛的应用和良好的推荐效果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值