慢吞吞的老方法:
推荐算法中常常会用到BPR损失,在进行负采样时,常用方法为:
def UniformSample_original_python(dataset):
"""
the original impliment of BPR Sampling in LightGCN
:return:
np.array
"""
total_start = time()
user_num = dataset.trainDataSize
users = np.random.randint(0, dataset.n_users, user_num) # low,high,size
allPos = dataset.allPos # 这里用到了之前采样到的 所有用户的正评分
S = []
for i, user in enumerate(users):
posForUser = allPos[user] # 正样本
if len(posForUser) == 0:
continue
posindex = np.random.randint(0, len(posForUser)) # 随机选择正样本
positem = posForUser[posindex]
while True:
negitem = np.random.randint(0, dataset.m_items) # 随机选择负样本
if negitem in posForUser:
continue
else:
break
S.append([user, positem, negitem])
print(time() - total_start)
return np.array(S)
上面是LightGCN源代码中给出的负采样代码,也是我之前用到的方法,即按照user一个一个采集正样本和负样本。
数据集比较小时倒还好,当遇到大数据集时,采样时间动不动就两分钟以上,实在是太慢了!
快快的新方法
参考CIRS代码,使用numba
库中的@njit
对负采样进行加速:
@njit
def find_negative(user_ids, video_ids, mat_small, mat_big, df_negative, max_item):
for i in range(len(user_ids)):
user, item = user_ids[i], video_ids[i] # 一条一条地取
neg = item + 1
while neg <= max_item:
if mat_small[user, neg] or mat_big[user, neg]: # True
neg += 1
else: # 找到了负样本,就退出
df_negative[i, 0] = user
df_negative[i, 1] = neg
break
else: # neg超出范围了,就减一
neg = item - 1
while neg >= 0:
if mat_small[user, neg] or mat_big[user, neg]:
neg -= 1
else:
df_negative[i, 0] = user
df_negative[i, 1] = neg
break
时间从111s一下子降到了3s!