基于elasticsearch和mahout的电影推荐

最新推荐文章于 2022-01-27 08:45:15 发布

weixin_33675507

最新推荐文章于 2022-01-27 08:45:15 发布

阅读量281

点赞数

文章标签：大数据 json python

原文链接：https://my.oschina.net/u/930279/blog/1189246

版权

2019独角兽企业重金招聘Python工程师标准>>>

1. 下载示例数据

ml-10m.zip
安装mahout
安装Elasticsearch

2. 创建elasticsearch的index

要求film要有如下格式:

 {
  "id": "65006",
  "title": "Impulse",
  "year": "2008",
  "genre": ["Mystery","Thriller"],
  "indicators": ["154","272",”154","308", "535", "583", "593", "668", "670", "680", "702", "745"],
  "numFields": 12
}

使用

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

格式的命令对Elasticsearch进行操作.具体可以参考 Elasticsearch 101 tutorial.

此例中,使用如下格式创建index

curl -XPUT 'http://localhost:9200/bigmovie' -d '
{
  "mappings": {
    "film" : {
      "properties" : {
        "numFields" : { "type" :   "integer" }
      }
    }
  }
}'

3. 导入movie的详细信息

解压下载的数据文件,查看movies.dat,其中的数据格式为:

MovieID::Title::Genres

如65006::Impulse (2008)::Mystery|Thriller

使用python对其转化,以便导入到Elasticsearch中.


import re
import json
count=0
with open('movies.dat','rb') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub("::", "\t", line).rstrip().split("\t")
        if len(fixed)==3:
          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))
          genre = fixed[2].split('|')
          print '{ "create" : { "_index" : "bigmovie", "_type" : "film","_id" : "%s" } }' %  fixed[0]
          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'% (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

运行 $ python index.py > index.json 对数据进行格式化,生成index.json.其格式为

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }
{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }
{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

现在可以导入到Elasticsearch中了.

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json

现在可以使用rest客户端或者curl访问Elasticsearch得到查询结果了.

4.利用mahout生成推荐信息.

已经完成了电影的详细了,下面生成有关推荐的信息. 查看数据文件ratins.dat.其格式为:

UserID::MovieID::Rating::Timestamp

如:

71567::2294::5::912577968
71567::2338::2::912578016

rating.dat 使用::作为分隔符,mahout要求\t为分隔符,所以需要格式化一下.

sed -i 's/::/\t/g' ratings.dat

会将ratings.dat修改为格式为item1 item2 rating timestamp,如

71567    2294    5    912580553
71567    2338    2    912580553

现在可以使用mahout对数据计算相似性了.

 mahout itemsimilarity \
  --input /user/user01/mlinput/ratings.dat \
  --output /user/user01/mloutput \
  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
  --booleanData TRUE \
  --tempDir /user/user01/temp

这里只涉及到计算,可以设置MAHOUT_LOCAL=true,不需要hadoop的支持. 这里使用的是SIMILARITY_LOGLIKELIHOOD``(Log Likelihood Ratio (LLR)),也可以使用其他算法生成的文件在/user/user01/mloutput目录下. 有如下格式(item1id item2id similarity),如:

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

5. 更新elasticsearch的movie的信息.

现在要将上面生成的结果添加到Elasticseaarch中,如

{
  "id": "65006",
  "title": "Impulse",
  "year": "2008",
  "genre": ["Mystery","Thriller"],
  "indicators": ["1076", "1936", "2057", "2204"],
  "numFields": 4
}

使用python读取文件,转为json.

import fileinput
from string import join
import json
import csv
import json
### read the output from MAHOUT and collect into hash ###
with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:
    csv_reader = csv.reader(csv_file,delimiter='\t')
    old_id = ""
    indicators = []
    update = {"update" : {"_id":""}}
    doc = {"doc" : {"indicators":[], "numFields":0}}
    for row in csv_reader:
        id = row[0]
        if (id != old_id and old_id != ""):
            update["update"]["_id"] = old_id
            doc["doc"]["indicators"] = indicators
            doc["doc"]["numFields"] = len(indicators)
            print(json.dumps(update))
            print(json.dumps(doc))
            indicators = [row[1]]
        else:
            indicators.append(row[1])
        old_id = id

$ python update.py > update.json

结果update.json,格式为:

{"update": {"_id": "1"}}
{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}
{"update": {"_id": "2"}}
{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

导入到Elasticsearch中

curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

6. 查询示例

curl 'http://master41:9200/bigmovie/film/_search?pretty' -d '
{
  "query": {
    "function_score": {
      "query": {
         "bool": {
           "must": [ { "match": { "indicators":"1237 551"} } ],
           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]
         }
      },
      "functions":[ {"random_score": {"seed":"48" } } ],
      "score_mode":"sum"
    }
  },
  "fields":["_id","title","genre"],
  "size":"8"
}'

转载于:https://my.oschina.net/u/930279/blog/1189246