三大推荐引擎
电商推荐、内容推荐、社交推荐
构建一个内容推荐引擎
构建一个内容推荐引擎要考虑:
- 场景:小说网站、缺乏运营
- 使用习惯:重度用户
- 搭建推荐推荐引擎:(1)分词工具(2)设计流程模块(3)coding(4)实时打分 real time ranking
流程模块:数据预处理 ---> 生成索引 ---> 加入引擎 ---> 接收请求 ---> 响应请求
数据准备
从网上找一些txt格式的小说作为原始数据,存储到指定路径,确保能被找到。
预处理
这里要做预处理首先要用到分词组件,我们使用结巴分词 jieba。具体使用可参考如下博客:https://blog.csdn.net/qq_14997473/article/details/87868832
# -*- coding: utf-8 -*-
import jieba
import string
import codecs #保存文件很好用,不会出现编码问题
# 预处理
class PreFile():
def __init__(self):
file_one = open("./artical/1.txt", encoding="utf-8")
file_two = open("./artical/2.txt", encoding="utf-8")
file_three = open("./artical/3.txt", encoding="utf-8")
file_four = open("./artical/4.txt", encoding="utf-8")
rsu_1 = self.tag(file_one, 1)
self.save(rsu_1, "1")
rsu_2 = self.tag(file_two, 3)
self.save(rsu_2, "2")
rsu_3 = self.tag(file_three, 5)
self.save(rsu_3, "3")
rsu_4 = self.tag(file_four, 7)
self.save(rsu_4, "4")
def tag(self, file, count):
"""
给文件打标签,并只要词频出现 count 次的
"""
print("object is ready")
print("================================")
result = {}
result_filter = {}
for line in file.readlines():
seg_list = jieba.cut(line.strip())
for seg in seg_list:
if seg not in result.keys():
result[seg] = 0
result[seg] += 1
for k, v in result.items():
if v <= count: continue
if k in string.punctuation: continue #去除英文标点
if k in [ " ", ",", "。","、", ";"]: continue #去除中文标点
if k in [ "你", "我", "他", "的", "和"]: continue #去除停用词
result_filter[k] = v
for k, v in result_filter.items():
print(k + "\t" + str(v))
return result_filter
def save(self, re, fn):
"""
保存文件索引(文件名字)
"""
file = codecs.open("./artical/index_cut.txt", "a", 'utf-8')
for k,v in re.items():
line = "filename: " + fn + "\t" + k + "\t" + str(v) + "\n"
file.write(line)
if __name__ == "__main__":
p = PreFile()
在使用open()函数读取txt文件时,程序出现了以下报错:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte
这是由windows系统下的txt文件格式问题导致的。将原txt文件打开,另存为并将编码方式修改为utf-8,再替换原文件即可。