在我了解到的所有推荐系统的结构里(都是课本,课堂上的水版)没有购买顺序的考虑,即所有的基础推荐系统都是把购买行为使用bag of words的方法进行过滤和推荐的。
结果呢,基本可以看出有旗鼓相当的一些人,因为之前的顺序,在下次购买时选择了不同的物品。至于为什么没人用这个算法,或者这类算法,请见 下文。
但是,有一个问题就是,真的购买顺序与下次购买无关吗?拿这个例子来说:买耳机,先买了个200块的AKG,又买了个200块的阿斯翠,这是很可能买手会发现国产性价比很不错,转而第三个仍然是用阿斯翠;如果先阿斯翠后AKG,可能会发现动圈的听感要比动铁舒服,可能下一个就是另一个牌子的动圈耳机了。
为了验证这个猜想,我使用有序版的Apriori算法进行测试,数据集是阿里推荐大赛的数据。使用python,很慢,所以只能用数据集的一部分,看一个结果。
算法过程结合了Apriori和skip-k-n-gram,使用Apriori的思想进行协同过滤,而过滤的关键字使用skip-k-n-gram产生,即abc,形成ab,ac,bc(skip-1-2-gram)的组合。
预处理数据:
#/usr/bin/python
import csv
import re
def load(filename):
print 'loading'
reader = csv.reader(file(filename, 'rb'))
pattern = re.compile(r'\d+')
maps = {}
for line in reader:
date = pattern.findall(line[-1])
time = (int)(date[0]) * 100 + (int)(date[1])
uid = (int)(line[0])
op = line[2] + '-' + line[1]
if(uid in maps):
maps[uid].append(op)
else:
maps[uid] = [op]
return maps
返回数据的格式是:{id:[op...]...},其中op是按时间排序的。
使用Apriori的变种进行验证我的想法:
#!/usr/bin/python
import dld
#data = {2:['ab', 'ab', 'ab', 'af', 'ab', 'dd','ab', 'ab', 'ab', 'af', 'ab', 'dd', 'fa', 'af'],\
# 3:['ab', 'ab', 'ab', 'af', 'ab', 'dd','ab', 'ab', 'ab', 'af', 'ab', 'fa', 'dd', 'af']}
data = dld.load('data.csv')
threshold = 10
def patterns(old, uid, i, k):
"""skip k n gram algorithm for user action"""
#for first call
if(not isinstance(old, list)):
old = [old]
#the whole action list
line = data[uid]
result = []#pattern, uid, endpoint in action list
j = 0
while (j < k and (i + j + 1) < len(line)):
result.append([old+[line[i + j + 1]], uid, i + j + 1])
j = j + 1
return result
def passes(pattern, t):
"""reform the result from patterns
and filter the results using threshold t
part of Apriori"""
pattern.sort(key=lambda record : record[0])
print 'sorted %d' %(len(pattern))
passed = []#[pattern, [[uid, endpoint], [uid, endpoint],...]
last = ['', 0, 0]#merge same actions
for item in pattern:
if (item[0] == last[0]):#same
passed[-1][1].append(item[1:])
else:#new
passed.append([item[0], [item[1:]]])
last = item
passed = [i for i in passed if len(i[1]) > t]#filter
return passed
"""" main """
print 'first round'
pattern = [];
for uid in data:
i = 0
line = data[uid]
for item in line:
pattern[len(pattern):] = patterns(item, uid, i ,3)
i = i + 1
#print pattern
passed = passes(pattern, threshold)
print 'passed %d' %(len(passed))
#print passed
j = 0
while (j < 2):
print 'round %d' %(j)
pattern = []
for record in passed:
old = record[0]
for item in record[1]:
pattern[len(pattern):] = patterns(old, item[0], item[1], 3)
#print pattern
passed = passes(pattern, threshold)
print 'passed %d' %(len(passed))
#print passed
j = j + 1
#print passed
print 'statics'
result = [[p[0], len(p[1])] for p in passed]#pattern, count
print 'result %d' %(len(result))
#print result
statics = {}
print 'analyse'
for item in result:
key = sorted(item[0][:-1])
hashkey = '+'.join(key)
if(hashkey in statics):
statics[hashkey].append([item[0], item[1]])
else:
statics[hashkey] = [[item[0], item[1]]]
sop = 0#same action
src = 0#same recommand
for static in statics:
out = statics[static]
before = len(out)
statics[static] = []
last = [[0],0]
maps = {}
for item in out:
if(cmp(item[0][0], item[0][1]) != 0 or cmp(item[0][0], item[0][2]) != 0):#all same action
if(last[0][:-1] == item[0][:-1]):
sop = sop + 1
if(last[0][1] < item[0][1]):
statics[static][-1] = item[0][1]
last = item
else:
if(item[0][-1] not in maps):
statics[static].append(item)
maps[item[0][-1]] = 0
last = item
else:
src = src + 1
#print 'ya filter %d/%d' %(len(statics[static]), before)
for static in statics:
out = statics[static]
if(len(out) > 2):
#print out
pass
print 'final\n positive:%d\n same action:%d\n same reco:%d' %(len(statics), sop, src)
结果呢,基本可以看出有旗鼓相当的一些人,因为之前的顺序,在下次购买时选择了不同的物品。至于为什么没人用这个算法,或者这类算法,请见 下文。