A collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours.
The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called “Using collaborative filtering to weave an information tapestry.” He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other
people.
There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.
一、 User-Based Filtering
Collecting Preferences
The first thing you need is a way to represent different people and their preferences. In Python, a very simple way to do this is to use a nested dictionary.
Finding Similar Users
You do this by comparing each person with every other person and calculating a similarity score. There are a few ways to do this, such as Euclidean
distance and Pearson correlation
Which Similarity Metric Should You Use
The best one to use will depend on your application, and it is worth trying Pearson, Euclidean distance, or others to see which you think gives better results.
An example for Pearson is seen in P35:
In the present example, some blogs contain more entries or much longer entries than others, and will thus contain more words overall. The Pearson
correlation will correct for this, since it really tries to determine how well two sets of data fit onto a straight line.
Ranking the Critics
Now that you have functions for comparing two people, you can create a function that scores everyone against a given person and finds the closest matches.
Recommending Items
you need to score the items by producing a weighted score that ranks the critics. See Table 2-2. Creating recommendations for Toby
Matching Products
what if you want to see which products are similar to each other. In this case, you can determine similarity by looking at who liked a particular item and seeing the other things they liked.
二、Item-Based Filtering
Why Item-Based Filtering
The way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset.
1) A very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow.
2) Also, a site that sells millions of products may have very little overlap between people, which can make it difficult to decide which people are similar.
In cases with very large datasets, item-based collaborative filtering can give better results, and it allows many of the calculations to be performed in advance.
The general technique for Item-Baesed Filtering is to precompute the most similar items for each item. Then, when you wish to make recommendations to a user, you look at his top-rated items and create a weighted list of the items most similar to those.