推荐系统自己去实践------第一个初步推荐系统movie

      终于在一系列的折腾,终于调试成功。这段时间自己去看了下python编程,得到的结论是编程语言是相通的,其他的都只是规则,只要足够的练习,一定会懂得怎么去用。这对编程能力弱的我,的确是增加了我的信心。

    ok,进入正题。

    我在参考http://blog.csdn.net/killua_hzl/article/details/7708201这篇文章的程序基础上实现的。主要是搭建系统的问题。一开始我用的是python3.3.2,一直会出现UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence这个错误。解决方法也是有的,我还没有去尝试。可以参考http://www.cnblogs.com/bigbigtree/archive/2013/08/02/3232545.html这个博客给的解决方案。这里,我直接偷懒,把python换成了2.7.5。这样,我就可以跑通。当然你需要在工作目录下建个文件夹data存放u.data和u.item。这里附上程序:

'''
Data set download from : http://www.grouplens.org/system/files/ml-100k.zip

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998.

This data set consists of:
    * 100,000 ratings (1-5) from 943 users on 1682 movies.
    * Each user has rated at least 20 movies.
    * Simple demographic info for the users 

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of
              user id | item id | rating | timestamp.
              The time stamps are unix seconds since 1/1/1970 UTC
u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.
'''

from math import sqrt

def loadMovieData(path = "./data"):
    """
    Load movie data from u.data and u.item
    @param path: Data set path
    """
    #Get movie's title
    movies = {}
    for line in open(path + '/u.item'):
        (movieId, movieTitle) = line.split('|')[0:2]
        movies[movieId] = movieTitle

    #Load Data
    movieData = {}
    userData = {}
    for line in open(path + '/u.data'):
        (userId, itemId, rating, timestamp)=line.split('\t')
        movieData.setdefault(movies[itemId], {})
        movieData[movies[itemId]][userId] = float(rating)
        userData.setdefault(userId, {})
        userData[userId][movies[movieId]] = float(rating)

    return (movieData, userData)

def euclidean(data, p1, p2):
    "Calculate Euclidean distance"
    distance = sum([pow(data[p1][item]-data[p2][item],2)
                      for item in data[p1] if item in data[p2]])

    return 1.0 / (1 + distance)

def pearson(data, p1, p2):
    "Calculate Pearson correlation coefficient"
    corrItems = [item for item in data[p1] if item in data[p2]]

    n = len(corrItems)
    if n == 0:
        return 0;

    sumX = sum([data[p1][item] for item in corrItems])
    sumY = sum([data[p2][item] for item in corrItems])
    sumXY = sum([data[p1][item] * data[p2][item] for item in corrItems])
    sumXsq = sum([pow(data[p1][item], 2) for item in corrItems])
    sumYsq = sum([pow(data[p2][item],2) for item in corrItems])         

    if sqrt((sumXsq - pow(sumX, 2) / n) * (sumYsq - pow(sumY, 2) / n)) != 0:
        pearson = (sumXY - sumX * sumY / n) / sqrt((sumXsq - pow(sumX, 2) / n) * (sumYsq - pow(sumY, 2) / n))
    else:
        return 0

    return pearson

def getSimilarItems(movieData, n = 20, similarity=pearson):
    """
    Create a dictionary of items showing which other items they are most similar to.
    """

    results = {}
    for movie in movieData:
        #Get n items which most similar to movie
        matches = [(similarity(movieData, movie, otherMovie),otherMovie)
                  for otherMovie in movieData if movie != otherMovie]
        matches.sort()
        matches.reverse()
        results[movie] = matches[0:n]

    return results

def getRecommendationsItems(userData, user, similarItems, n = 10):
    """
    Get recommendations items for user
    """
    userRatings = userData[user]
    itemScores = {}
    totalSim = {}

    # Loop over items rated by this user
    for (item, rating) in userRatings.items():
        # Loop over items similar to this one
        for (simValue, simItem) in similarItems[item]:
            # Ignore if this user has already rated this item
            if simItem in userRatings:
                continue
            # Weighted sum of rating times similarity
            itemScores.setdefault(simItem, 0)
            itemScores[simItem] += simValue * rating
            # Sum of all the similarities
            totalSim.setdefault(simItem, 0)
            totalSim[simItem] += simValue

    # Divide each total score by total weighting to get an average
    rankings = [(score / totalSim[item], item) for (item, score) in itemScores.items() if totalSim[item] != 0]
    rankings.sort()
    rankings.reverse()

    return rankings[0:n]

if __name__ == "__main__":

    print ('Loading Data...')
    movieData, userData = loadMovieData("./data")
    print ('Get similarItems...')
    similarItems = getSimilarItems(movieData, 50, euclidean)
    print ('Calculate rankings...')
    rankings = getRecommendationsItems(userData, "87", similarItems)

    print (rankings)
最后得到的实验结果是:
>>> 
Loading Data...
Get similarItems...
Calculate rankings...
[(4.0, '\xc1 k\xf6ldum klaka (Cold Fever) (1994)'), (4.0, 'unknown'), (4.0, 'Zeus and Roxanne (1997)'), (4.0, "Young Poisoner's Handbook, The (1995)"), (4.0, 'Young Guns II (1990)'), (4.0, 'Young Guns (1988)'), (4.0, 'Young Frankenstein (1974)'), (4.0, 'You So Crazy (1994)'), (4.0, 'Year of the Horse (1997)'), (4.0, 'Yankee Zulu (1994)')]

 

解释下上述的代码(下面是说明函数的功能,具体的可以见《集体智慧编程》):

loadMovieData:用于数据的读取。userData指的是以userId为键构建的电影评分列表。movieData值的是以movieId为键构建的电影评分列表。

euclidean:用于计算Eucidean距离系数

pearson:用于计算Pearson相关系数

getSimilarItems:计算出movieData中每一项相似度最大的n项

getRecommendationsItems:对于某个user取得推荐结果

 

当然,结果更是好理解了。最后得到给这个用户推荐这10部电影。

 

此外,感谢qq群好友的帮忙。如果有什么问题,欢迎指正。

 

 

 

   

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值