python中列表的排序方法_在python中获得排序唯一列表的最快方法?

1586010002-jmsa.png

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn't matter whether the list is modified in place, or I get a new list, or an iterable. In my concrete use case, I'm doing this with a throwaway list, so in place would be more memory efficient.)

I've seen solutions like

input = [5, 4, 2, 8, 4, 2, 1]

sorted(set(input))

but it seems to me that first checking for uniqueness and then sorting is wasteful (since when you sort the list, you basically have to determine insertion points, and thus get the uniqueness test as a side effect). Maybe there is something more along the lines of unix's

cat list | sort | uniq

that just picks out consecutive duplications in an already sorted list?

Note in the question ' Fastest way to uniqify a list in Python ' the list is not sorted, and ' What is the cleanest way to do a sort plus uniq on a Python list? ' asks for the cleanest / most pythonic way, and the accepted answer suggests sorted(set(input)), which I'm trying to improve on.

解决方案

I believe sorted(set(sequence)) is the fastest way of doing it.

Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.

Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.

When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!

A small comparison of the "canonical solution" versus the groupby solution:

>>> import timeit

>>> sequence = list(range(500)) + list(range(700)) + list(range(1000))

>>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)

0.11532402038574219

>>> import itertools

>>> def my_sort(seq):

... return list(k for k,_ in itertools.groupby(sorted(seq)))

...

>>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)

0.3162040710449219

As you can see it's 3 times slower.

The version provided by jdm is actually even worse:

>>> def make_unique(lst):

... if len(lst) <= 1:

... return lst

... last = lst[-1]

... for i in range(len(lst) - 2, -1, -1):

... item = lst[i]

... if item == last:

... del lst[i]

... else:

... last = item

...

>>> def my_sort2(seq):

... make_unique(sorted(seq))

...

>>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)

0.46814608573913574

Almost 5 times slower.

Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.

The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值