业务场景:
有上亿条的数据入库解析并且入库到sqlserver中去,所以每次优化一秒钟,可能对入库的性能就能提升一天。
python语句的优化,里边有对list数据去重的代码如下:
object_id_set = []
remove_objects = []
for object in objects:
try:
if object['object_id'] in object_id_set:
remove_objects.append(object)
else:
object_id_set.append(object['object_id'])
except Exception as e:
rh_utils.rh_logger.exception('filter object error: {0}, object: {1}'.format(e, object))
for remove_object in remove_objects:
objects.remove(remove_object)
修改点就是 把 object_id_set 变成 dict: {}
object_id_set = {}
remove_objects = []
for object in objects:
try:
if object['object_id'] in object_id_set:
remove_objects.append(object)
else:
object_id_set[object['object_id']] = 1
except Exception as e:
remove_objects.append(object)
rh_utils.rh_logger.exception('filter object error: {0}, object: {1}'.format(e, object))
for remove_object in remove_objects:
objects.remove(remove_object)
当 过滤的objects的数目达到1万的时候,时间上面方式的时间居然达到了2s,不可接受。改成下面方法之后变成0.1s的样子。
看来对list进行 in 操作应该是遍历了整个 list的复杂度为 O(n), 而dict大概有类似hash的东西保证查询的复杂度为 O(1).