python 子集_如何使用Python获得令人尴尬的快速随机子集采样

最新推荐文章于 2024-06-15 20:43:55 发布

cumian8165

最新推荐文章于 2024-06-15 20:43:55 发布

阅读量555

点赞数

文章标签：算法 python java 机器学习人工智能

原文链接：https://www.freecodecamp.org/news/how-to-get-embarrassingly-fast-random-subset-sampling-with-python-da9b27d494d9/

版权

python 子集

by Kirill Dubovikov

通过基里尔·杜博维科夫(Kirill Dubovikov)

如何使用Python获得令人尴尬的快速随机子集采样 (How to get embarrassingly fast random subset sampling with Python)

Imagine that you are developing a machine learning model to classify articles. You have managed to get an unreasonably large text file which contains millions of identifiers of similar articles that belong to the same class. You are unsure whether identifiers that are close to each other are independent.

想象一下，您正在开发一种机器学习模型来对文章进行分类。您已经设法获得了一个不合理的大文本文件，其中包含数百万个属于同一类的相似文章的标识符。您不确定彼此靠近的标识符是否独立。

For example, a parser could write identifiers of articles from a single site together. So now you want to get a large number of random samples from an array of several million elements to create a training dataset or count some empirical statistics. This situation can come up in practice more frequently than you think.

例如，解析器可以一起编写来自单个站点的文章的标识符。因此，现在您要从数百万个元素的数组中获取大量随机样本，以创建训练数据集或对一些经验统计数据进行计数。实际上，这种情况可能比您想象的要频繁得多。

Thanks to Binder, you can play with the code online without installing anything locally. Or you can clone the Github repository. Please note that all benchmarks may differ from machine to machine.

多亏了Binder，您可以在线播放代码，而无需在本地安装任何内容。或者，您可以克隆Github存储库。请注意，所有基准测试因机器而异。

Well, what’s the matter? Let’s use numpy!

恩，怎么了？让我们使用numpy ！

On Macbook Pro, this code runs for around 1.4s per loop. If you want to get 100,000 samples, this will take about a day and a half. Ouch!

在Macbook Pro上，此代码每个循环运行约1.4秒 。如果要获取100,000个样本，则大约需要一天半的时间。哎哟!

起床speed️ (Getting up to speed ☄️)

What happened there? To generate a random sample, numpy.random.choice permutes the array each time we call it. When our sample size is only a fraction of the whole array length, we do not need to shuffle the array each time we want to take a sample. Let’s just shuffle it once and take samples from the start of the shuffled array.

那里发生了什么？为了生成随机样本， 每次调用 numpy.random.choice都会对数组进行置换。当我们的样本大小只是整个数组长度的一小部分时，我们不需要在每次要采样时都对数组进行洗牌。让我们将其洗牌一次，并从洗牌后的数组开始采样。

When we come to the last element, we must shuffle it again. This optimization also has a very nice side effect: we will have fewer collisions (repeating samples).

当我们谈到最后一个要素时，我们必须再次对其进行洗牌。这种优化还具有非常好的副作用：我们将减少碰撞(重复样本)。

Now it is time to code this up:

现在是时候编写代码了：

This time we get 21.1 µs ± 979 ns per loop, which is faster by several orders of magnitude.

这次，每个环路得到21.1 µs±979 ns，这快了几个数量级。

甚至更快？？ (Even faster? ?)

Can we do it even faster? Yes, but we need to go native. Cython translates Python-like code to optimized native C or C++ which can be compiled and used as a friendly and familiar Python module afterwards.

我们可以做得更快吗？是的，但是我们需要本土化。 Cython将类似Python的代码转换为优化的本机C或C ++，然后可以对其进行编译并用作友好而熟悉的Python模块。

You can play with Cython in Jupyter notebooks by loading the Cython extension with %load_ext Cython , and using %%cython magic as the first statement within a cell with Cython code.

您可以在Jupyter笔记本电脑中使用Cython玩游戏，方法是在Cypyon扩展程序中加载%load_ext Cython ，并将%%cython magic用作带有Cython代码的单元格中的第一条语句。

Almost all Python code is valid Cython code. But to get the most of it, we need to make use of extensions provided by Cython:

几乎所有的Python代码都是有效的Cython代码。但是要充分利用它，我们需要利用Cython提供的扩展：

We statically annotate all types for function signatures and variable definition to use native C variables instead of slow Python objects where possible
我们会为函数签名和变量定义静态注释所有类型，以便在可能的情况下使用本机C变量而不是慢速Python对象
We use a cdef keyword for functions that do not need to be exported as a Python API. cdef calls are much faster.
对于不需要导出为Python API的函数，我们使用cdef关键字。 cdef调用要快得多。
We disable negative indexing and array bounds checking with @cython.wraparoundand @cython.boundscheck to get more speed
我们使用@cython.wraparound和@cython.boundscheck禁用负索引和数组边界检查，以提高速度

This minor refactoring is sufficient to get a reasonable speedup (2x on my laptop) compared to the Python version.

与Python版本相比，这种较小的重构足以获得合理的加速(在我的笔记本电脑上为2倍)。

I am obliged to say that Cython is much more than an optimized Python-C translator. With this awesome tool you can also:

我不得不说Cython不仅仅是一个优化的Python-C转换器。有了这个很棒的工具，您还可以：

Overcome limitations of GIL
克服GIL的局限性
Parallelize your code with high-level OpenMP wrappers (with threads, not processes!)
使用高级OpenMP包装器并行化代码 (使用线程，而不是进程！)
Use fast arrays via memoryviews
通过memoryviews使用快速数组
Expose raw memory buffers to Python code via Buffer Protocol
通过缓冲区协议将原始内存缓冲区公开给Python代码

碰撞呢？？ (What about collisions? ?)

Sampling collisions occur when we get a repeating element while sampling the array. For simplicity, let’s suppose that the array does not contain duplicates.

当我们在对数组进行采样时得到重复元素时，就会发生采样冲突。为了简单起见，我们假设该数组不包含重复项。

We’ll compare the two algorithms in terms of collisions. We can collect a large number of samples from the same array for each of the algorithms, and then count up the total number of collisions.

我们将在碰撞方面比较这两种算法。对于每种算法，我们可以从同一数组中收集大量样本，然后计算出冲突的总数。

When we repeat this process several times and record the results, we are actually collecting a random sample of collision counts for both algorithms.

当我们重复此过程几次并记录结果时，实际上我们正在为这两种算法收集一个随机的碰撞计数样本。

Having those samples at hand, we can apply statistics to compare them. In this case, we will use a t-test (you can read more about t-distribution in my previous post and more about t-test here).

有了这些样本，我们可以应用统计数据进行比较。在这种情况下，我们将使用t检验(您可以在我之前的文章中阅读有关t分布的更多信息，并在这里阅读有关t检验的更多信息)。

The p-value we get is 0, which means that the result we got is significant.

我们得到的p值为0，这意味着我们得到的结果是有意义的。

Let’s make a plot and see the difference:

让我们作图并看一下区别：

As you can see, we get way lower collision numbers as a bonus.

如您所见，我们获得了更低的碰撞次数作为奖励。