数据一大，就体现出了算法的重要性

最新推荐文章于 2024-10-09 22:28:58 发布

翩若惊鸿_

最新推荐文章于 2024-10-09 22:28:58 发布

阅读量104

点赞数

文章标签：算法 python 机器学习

本文链接：https://blog.csdn.net/nkhth/article/details/129661763

版权

文章讲述了在处理论文数据预处理时遇到的问题，原始代码由于在构建无重复三元组列表时采用了线性搜索，导致效率极低，耗时一天半仅完成35%。作者通过使用Numpy进行矩阵操作和集合去重，将算法复杂度降低，成功将处理时间从几小时甚至几天缩短到几秒钟。这强调了在处理大数据时优化算法和利用Python库的重要性。

摘要由CSDN通过智能技术生成

跑一篇论文的代码，预处理数据就感觉不对劲，跑了一天半才35%：
离谱至极
仅仅是一个数据预处理，怎么能这么慢？
仔细看过代码以后，大概明白是怎么回事了：

def construct_data(self, n_tim_rel, tim_dis_dict):
     train_kg_ptp, train_kg_upt, train_kg = [], [], []
     train_kg_dict, train_kg = collections.defaultdict(list), collections.defaultdict(list)
     n_locs = len(self.vid_list)
     # get utp-triple 
     head_upt = [(triple[0] + n_locs) for triple in self.train_utp]
     rel_upt = [triple[1] for triple in self.train_utp]
     tail_upt = [triple[2] for triple in self.train_utp]
     # get ptp-triple
     head_ptp = [triple[0] for triple in self.train_ptp]
     rel_ptp = [int(tim_dis_dict[tuple(triple[1])] + n_tim_rel) for triple in self.train_ptp]
     tail_ptp = [triple[2] for triple in self.train_ptp]
     print("---------start utp------------")
     for i in tqdm(range(len(head_upt))):
         if [head_upt[i], rel_upt[i], tail_upt[i]] not in train_kg['utp']:
             train_kg_dict[head_upt[i]].append((tail_upt[i], rel_upt[i]))
             train_kg['utp'].append([head_upt[i], rel_upt[i], tail_upt[i]])
     print("---------start ptp------------")
     for j in tqdm(range(len(head_ptp))):
         if [head_ptp[j], rel_ptp[j], tail_ptp[j]] not in train_kg['ptp']:
             train_kg_dict[head_ptp[j]].append((tail_ptp[j], rel_ptp[j]))
             train_kg['ptp'].append([head_ptp[j], rel_ptp[j], tail_ptp[j]])
     print('load KG data.')
     return train_kg_dict, train_kg

问题出在这几句代码上：

print("---------start utp------------")
for i in tqdm(range(len(head_upt))):
    if [head_upt[i], rel_upt[i], tail_upt[i]] not in train_kg['utp']:
        train_kg_dict[head_upt[i]].append((tail_upt[i], rel_upt[i]))
        train_kg['utp'].append([head_upt[i], rel_upt[i], tail_upt[i]])
print("---------start ptp------------")
for j in tqdm(range(len(head_ptp))):
    if [head_ptp[j], rel_ptp[j], tail_ptp[j]] not in train_kg['ptp']:
        train_kg_dict[head_ptp[j]].append((tail_ptp[j], rel_ptp[j]))
        train_kg['ptp'].append([head_ptp[j], rel_ptp[j], tail_ptp[j]])

可以看到作者其实只是想做一个没有重复三元组的list，但是他使用的方法是每次插入之前都在整个list中做查询，算法的复杂度直接变成了O(n^2)

这种算法太笨了，想了想，参考这篇博客，我改成了下面这样：

print("---------start utp------------")
upt_mat = np.array((head_upt, rel_upt, tail_upt))  # 3 * n
upt_mat = upt_mat.T  # n * 3
temp = list(set([tuple(t) for t in upt_mat]))  # 去重
temp = [list(v) for v in temp]  # tuple->list
train_kg['utp'] = temp
for i in range(len(temp)):
    train_kg_dict[temp[i][0]].append((temp[i][2], temp[i][1]))
print("---------start ptp------------")
ptp_mat = np.array((head_ptp, rel_ptp, tail_ptp))  # 3 * n
ptp_mat = ptp_mat.T  # n * 3
temp = list(set([tuple(t) for t in ptp_mat]))  # 去重
temp = [list(v) for v in temp]  # tuple->list
train_kg['ptp'] = temp
for i in range(len(temp)):
    train_kg_dict[temp[i][0]].append((temp[i][2], temp[i][1]))