熊猫烧香代码_此功能可以使您的熊猫代码明显更快

熊猫烧香代码

Pandas is an awesome library, but it’s not the fastest. There are ways of making it faster though, and that is by using the right tool for the right task. Today we’ll explore one of these tools, and it will make everyday tasks significantly faster.

Pandas是一个很棒的库,但是它不是最快的。 但是有多种方法可以使它更快,那就是通过使用正确的工具来完成正确的任务。 今天,我们将探索这些工具之一,它将使日常任务大大加快。

There’s no point in making the introduction section any longer, so let’s take a quick look at the article structure and proceed with the fun stuff. The article is split into several sections:

不再需要对介绍部分进行任何说明,因此让我们快速看一下文章结构并继续学习有趣的东西。 本文分为几个部分:

  • Dataset and problem overview

    数据集和问题概述
  • The no-go solution

    一劳永逸的解决方案
  • The go-to solution

    首选解决方案
  • Conclusion

    结论

So without much ado, let’s get started!

因此,事不宜迟,让我们开始吧!

数据集和问题概述 (Dataset and problem overview)

We’ll need two libraries for the demonstration, and those are Numpy and Pandas. The dataset is completely made up and shows sales for 3 products on a particular date. Dates are split into 3 columns, just to make everything a bit harder and heavier for the computer.

我们需要两个库进行演示,分别是NumpyPandas 。 该数据集已完全构成,并显示了特定日期的3种产品的销售额。 日期分为三列,只是使计算机的工作变得越来越困难。

Anyway, here’s the dataset:

无论如何,这是数据集:

Image for post

Nothing special here, but we have a decent amount of data — 100K rows. Here’s how the first couple of rows look like:

这里没什么特别的,但是我们有相当数量的数据-10万行。 这是前几行的样子:

Image for post

We know how many items were sold on a particular day, but we don’t know the unit price — so let’s declare that quickly:

我们知道某一天售出了多少商品,但我们不知道单价-因此,我们快速声明一下:

Image for post

Awesome! Here’s what our end goal is (for every row):

太棒了! 这是我们的最终目标(每一行):

  • Combine Year, Month, and Day into a single variable

    YearMonthDay合并为一个变量

  • Calculate the profit by day — by multiplying the unit price with the amount sold

    按天计算利润-将单价乘以售出金额
  • Append those two variables as a key-value pair to a list

    将这两个变量作为键值对追加到列表中

We also want to do this fast. There are multiple ways we can approach this problem, but only one of them is optimal.

我们也想快速地做到这一点。 我们可以采用多种方法来解决此问题,但是只有其中一种是最佳的。

Let’s explore what not to do before we announce the go-to approach for this type of task.

在宣布此类任务的入门方法之前,让我们探索不应该做的事情。

一劳永逸的解决方案 (The no-go solution)

We have quite a bit of work to do, as discussed in the previous section. That doesn’t mean it should require a lot of time for the computer to finish. You’d be surprised how quickly this can be done.

如上一节所述,我们还有很多工作要做。 这并不意味着计算机需要大量时间才能完成。 您会惊讶地如此快地完成。

But first, let’s explore the worst of two options, and that is by using the iterrows function to iterate through DataFrame rows and perform calculations.

但是首先,让我们探讨两个选项中最糟糕的一种,那就是使用iterrows函数迭代DataFrame行并执行计算。

Here’s the code:

这是代码:

Image for post

So, we’ve completed all of the tasks described in the previous sections, and it took almost 8 seconds to complete.

因此,我们已经完成了前面几节中描述的所有任务,并且花费了将近8秒钟来完成。

You might think that it isn’t that bad, but just wait until you see what we’ll do next.

您可能会认为还不错,但是请等到看到我们下一步将要做的事情。

首选解决方案 (The go-to solution)

If you think that iterrows is good, wait until you meet the itertuples. It’s a similar function, used to iterate through DataFrame rows, but it does it so much faster.

如果您认为iterrows很好,请等到遇到itertuples 。 这是一个类似的函数,用于遍历DataFrame行,但它的执行速度非常快。

We’ll perform the same task and compare the execution times. The itertuples works with, well, tuples, so we won’t be able to access the DataFrame values using the bracket notation. That’s the only difference.

我们将执行相同的任务并比较执行时间。 itertuples可以与元组一起使用,因此我们将无法使用方括号表示法访问DataFrame值。 那是唯一的区别。

Anyway, here’s the code:

无论如何,这是代码:

Image for post

This took only 0.218 seconds to complete! That is a 35x decrease and is quite significant. Maybe not so much in this toy example, as 8 seconds isn’t that much of a time to wait, but this scales easily to millions or tens of millions of rows.

仅需0.218秒即可完成! 这减少了35倍,而且意义重大。 在这个玩具示例中可能没有那么多,因为等待8秒的时间并不多,但这很容易扩展到数百万或数千万行。

Remember — always use itertuples when performing tasks similar to this one.

请记住-在执行与此任务相似的任务时,请始终使用itertuples

你走之前 (Before you go)

This is the second time I’ve covered this topic, but this time the dataset and the problem tasks are much heavier for the computer and for the developer to code out, so I find it worth sharing.

这是我第二次讨论此主题,但是这一次数据集和问题任务对于计算机和开发人员来说要重得多,因此我觉得值得分享。

Don’t take this 35x speed improvement as something guaranteed, as the results may vary, depending on the type of the task and your hardware. Nevertheless, iteruples should outperform iterrows every time. That’s what matters.

不要将这种35倍的速度提高作为保证,因为结果可能会有所不同,具体取决于任务的类型和您的硬件。 但是, iteruples都应该胜过iterrows 。 那才是最重要的。

翻译自: https://towardsdatascience.com/this-function-can-make-your-pandas-code-significantly-faster-d018fb5045a9

熊猫烧香代码

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值