Timsort-you您从未听说过的最快的排序算法

最新推荐文章于 2023-05-31 15:52:47 发布

cunxiedian8614

最新推荐文章于 2023-05-31 15:52:47 发布

阅读量439

点赞数

文章标签：数据结构与算法 python java

Timsort — the fastest sorting algorithm you’ve never heard of — Image of Tim Peter from here

Timsort是一种排序算法，对于实际数据非常有效，并且不是在学术实验室中创建的。蒂姆·彼得斯（Tim Peters）于2001年为Python编程语言创建了Timsort。Timsort首先分析要尝试排序的列表，然后根据对该列表的分析选择一种方法。

Since the algorithm has been invented it has been used as the default sorting algorithm in Python, Java, the Android Platform, and in GNU Octave.

Timsort’s big O notation is O(n log n). To learn about Big O notation, read this.

Timsort的排序时间与Merge Sort相同，这比您可能知道的大多数其他排序要快。正如您很快就会看到的，Timsort实际上利用了插入排序和合并排序。

Peters将Timsort设计为使用大多数现实数据集中存在的已排序元素。它称这些已经排序的元素为“自然运行”。它遍历数据，将元素收集到运行中，同时将这些运行合并为一个。

The array has fewer than 64 elements in it

如果我们要排序的数组中少于64个元素，Timsort将执行插入排序。

插入排序是一种简单的排序，对小列表最有效。在较大的列表中，它相当慢，而在较小的列表中，它非常快。插入排序的概念如下：

一一看待元素通过将元素插入正确的位置来建立排序列表

在这种情况下，我们将新排序的元素插入到新的子数组中，该子数组从数组的开头开始。

这是显示插入排序的gif：

More about runs

如果列表大于64个元素，则算法将首先通过列表查找严格增加或减少的零件。如果该部分正在减少，它将反转该部分。

因此，如果运行减少，则外观将如下所示（运行以粗体显示）：

如果不减少，则看起来像这样：

minrun是根据数组大小确定的大小。该算法选择它，以便随机数组中的大多数游程的长度为minrun或变为最小游程。当运行次数等于或小于2的幂时，合并2个阵列会更有效率。 Timsort选择minrun来尝试通过确保minrun等于或小于2的幂来确保这种效率。

该算法从32到64（含）范围内选择minrun。它选择minrun，以使原始数组的长度除以minrun等于或小于2的幂。

如果运行的长度小于minrun，则应从minrun计算该运行的长度。使用此新编号，您可以在运行之前捕获许多项目并执行插入排序以创建新的运行。

因此，如果minrun为63并且运行的长度为33，则执行63–33 =30。然后从运行结束之前获取30个元素，因此这是run [33]的30个项目，然后执行插入排序以创建新的运行。

这部分完成之后，我们现在应该在列表中有一堆排序的运行。

Merging

Timsort现在执行mergesort将运行合并在一起。但是，Timsort确保在合并排序时保持稳定性和合并平衡。

为了保持稳定性，我们不应交换2个相等值的数字。这样不仅可以将其原始位置保留在列表中，还可以使算法更快。我们将在短期内讨论合并余额。

当Timsort找到运行时，它将它们添加到堆栈中。一个简单的堆栈如下所示：

想象一堆盘子。您不能从底部取板，因此必须从顶部取板。堆栈也是如此。

在运行mergesort时，Timsort试图平衡两个相互竞争的需求。一方面，我们希望尽可能长地延迟合并，以利用以后可能出现的模式。但是我们甚至希望尽快合并，以利用刚刚发现的运行在内存层次结构中仍然很高的运行。我们也不能延迟“太长”合并，因为它会消耗内存以记住仍未合并的运行，并且堆栈具有固定的大小。

为了确保我们能够妥协，Timsort会跟踪堆栈中最近的三个项目，并创建两个必须适用于这些项目的定律：

A > B + CB > C

其中A，B和C是堆栈中最近的三个项目。

用蒂姆·彼得斯本人的话说：

事实证明，这是一个很好的折衷方案，在堆栈条目上保留了两个不变量，其中A，B和C是三个最严格的尚未合并切片的长度

通常，很难将不同长度的相邻行合并到位。更难的是我们必须保持稳定。为了解决这个问题，Timsort保留了临时内存。它将两个运行中较小的一个（调用运行A和B）称为临时内存。

Galloping

Timsort将A和B合并时，它注意到一次运行已连续多次“获胜”。如果事实证明，运行A的数量完全少于运行B的数量，则运行A将最终回到其原始位置。合并这两个运行将涉及大量工作，但一无所获。

数据通常会具有一些预先存在的内部结构。 Timsort假设，如果运行A的很多值都低于运行B的值，那么A可能会继续小于B。

2个示例运行A和B的图像。运行必须严格增加或减少，因此为什么要选择这些数字。

然后，Timsort将进入舞动模式。 Timsort不会对A [0]和B [0]进行相互检查，而是对a [0]中b [0]的适当位置执行二进制搜索。这样，Timsort可以将A的整个部分移到适当的位置。然后，Timsort搜索B中A [0]的适当位置。然后，Timsort将立即移动B罐的整个部分并将其放置到位。

Let’s see this in action. Timsort checks B0 and using a binary search it looks for the correct location in A.

Well, B[0] belongs at the back of the list of A. Now Timsort checks for A0 in the correct location of B. So we’re looking to see where the number 1 goes. This number goes at the start of B. We now know that B belongs at the end of A and A belongs at the start of B.

事实证明，如果B [0]的适当位置非常接近A的开头（反之亦然），则此操作不值得。因此，如果不成功，驰gall模式会迅速退出。此外，Timsort注意到了这一点，并且通过增加进入所需的连续仅A或仅B获胜次数，使以后更难以进入驰op模式。如果疾驰模式得到回报，Timsort可使重新输入变得更加容易。

简而言之，Timsort的两件事情做得非常好：

具有预先存在的内部结构的阵列具有出色的性能能够保持稳定的排序

以前，为了实现稳定的排序，您必须使用整数压缩列表中的项目，然后将其作为元组数组进行排序。

Code

如果您对代码不感兴趣，请随时跳过此部分。本节下面有更多信息。

The source code below is based on mine and Nanda Javarma’s work. The source code is not complete, nor is it similar to Python’s offical sorted() source code. This is just a dumbed-down Timsort I implemented to get a general feel of Timsort. If you want to see Timsort’s original source code in all its glory, check it out here. Timsort is offically implemented in C, not Python.

# based off of this code https://gist.github.com/nandajavarma/a3a6b62f34e74ec4c31674934327bbd3
# Brandon Skerritt
# https://skerritt.tech

def binary_search(the_array, item, start, end):
    if start == end:
        if the_array[start] > item:
            return start
        else:
            return start + 1
    if start > end:
        return start

    mid = round((start + end)/ 2)

    if the_array[mid] < item:
        return binary_search(the_array, item, mid + 1, end)

    elif the_array[mid] > item:
        return binary_search(the_array, item, start, mid - 1)

    else:
        return mid

"""
Insertion sort that timsort uses if the array size is small or if
the size of the "run" is small
"""
def insertion_sort(the_array):
    l = len(the_array)
    for index in range(1, l):
        value = the_array[index]
        pos = binary_search(the_array, value, 0, index - 1)
        the_array = the_array[:pos] + [value] + the_array[pos:index] + the_array[index+1:]
    return the_array

def merge(left, right):
    """Takes two sorted lists and returns a single sorted list by comparing the
    elements one at a time.
    [1, 2, 3, 4, 5, 6]
    """
    if not left:
        return right
    if not right:
        return left
    if left[0] < right[0]:
        return [left[0]] + merge(left[1:], right)
    return [right[0]] + merge(left, right[1:])

def timsort(the_array):
    runs, sorted_runs = [], []
    length = len(the_array)
    new_run = [the_array[0]]

    # for every i in the range of 1 to length of array
    for i in range(1, length):
        # if i is at the end of the list
        if i == length - 1:
            new_run.append(the_array[i])
            runs.append(new_run)
            break
        # if the i'th element of the array is less than the one before it
        if the_array[i] < the_array[i-1]:
            # if new_run is set to None (NULL)
            if not new_run:
                runs.append([the_array[i]])
                new_run.append(the_array[i])
            else:
                runs.append(new_run)
                new_run = []
        # else if its equal to or more than
        else:
            new_run.append(the_array[i])

    # for every item in runs, append it using insertion sort
    for item in runs:
        sorted_runs.append(insertion_sort(item))

    # for every run in sorted_runs, merge them
    sorted_array = []
    for run in sorted_runs:
        sorted_array = merge(sorted_array, run)

    print(sorted_array)

timsort([2, 3, 1, 5, 6, 7])

Timsort实际上是直接内置在Python中的，因此此代码仅用作解释器。要使用Timsort，只需编写：

list.sort()

要么

sorted(list)

如果您想掌握Timsort的工作原理并对此有所了解，我强烈建议您尝试自己实现它！

This article is based on Tim Peters’ original introduction to Timsort, found here.

👋 Did you like this article? Subscribe to my email list to get an email whenever I post something new (which is about once every 3 - 6 weeks) ✨😁 https://pages.convertkit.com/2ffbe6834c/8444e0640e

from: https://dev.to//brandonskerritt/timsort-the-fastest-sorting-algorithm-you-ve-never-heard-of-2ake