熊猫烧香代码_优化熊猫代码的综合指南

熊猫烧香代码

数据科学 (Data Science)

In this to guide, I am going to show you some of the most common pitfalls that can cause otherwise perfectly good Pandas code to be too slow for any time-sensitive applications, and walk through a set of tips and tricks to avoid them.

在本指南中,我将向您展示一些最常见的陷阱,这些陷阱可能会使本来很好的Pandas代码对于任何对时间敏感的应用程序都变得太慢,并逐步介绍了避免它们的技巧和窍门。

Let’s remind ourselves what is pandas, apart from a cute animal 🐼. Its a widely used library for data analysis and manipulation that load all the data into RAM.

除了可爱的动物🐼,让我们提醒自己什么是熊猫。 它是用于数据分析和处理的广泛使用的库,可将所有数据加载到RAM中。

In this article, I am going to use a dataset that contains meal invoices (one million rows) 📉.

在本文中,我将使用一个包含膳食发票(一百万行)的数据集。

df = load_dataset()
df.head()
png

为什么要表现🤨 (Why Performance🤨)

  • Fast is better than slow- because no one loves to wait for his code to be executed 🐇.

    快比慢要好-因为没有人喜欢等待他的代码被执行🐇。

  • Memory efficiency is good- because “Out of memory” exceptions are scary. 💾

    内存效率很高-因为“内存不足”异常很可怕。 💾

  • Saving money is awesome- by using less powerful machines, we can reduce our AWS/GCP costs 💸.

    省钱真是棒极了-通过使用功能更弱的计算机,我们可以降低AWS / GCP的成本💸。

  • Hardware will only take you so far- there is a limit to hardware performance 💻.

    硬件只会带您到此为止-硬件性能有限。

Ok, now that I have “sold” you that we should care about performance, the next question I want to tackle is when should we optimize our code. Spoiler alert the surprising answer is: “NOT ALWAYS.”

好的,既然我已经“出售”了您我们应该关心的性能,那么我要解决的下一个问题是何时应该优化代码。 扰流板警报的令人惊讶的答案是:“永远不会。”

Image for post
https://www.pinterest.ie/pin/764415736723944164/ https://www.pinterest.ie/pin/764415736723944164/

何时优化⏰ (When to Optimize⏰)

Since program readability is our top priority, as we aim to make the programmer’s life easier, we should only optimize our code when needed, and in other words “all optimizations are premature unless”:

由于程序可读性是我们的首要任务,因此我们旨在简化程序员的生活,因此,我们仅应在需要时对代码进行优化,换句话说,“ 除非是所有优化,否则都为时过早”

  • The program doesn’t meet requirements- whether it’s too slow for the user or whether it’s taking too much memory 🚔.

    该程序不符合要求-对用户来说太慢还是占用太多内存🚔。

  • Program execution effects development pace- if the program is slow, then it will affect developer productivity, which will make each feature much longer to develop👷.

    程序执行会影响开发速度-如果程序运行缓慢,则会影响开发人员的工作效率,这会使每个功能的开发时间更长👷。

Since optimizing our code can be time-consuming, we should refactor only the problematic parts.

由于优化我们的代码可能很耗时,因此我们应该仅重构有问题的部分。

This can be done by profiling our program to identify the bottlenecks. Since this is a huge topic, I won’t delve into details and will use the following profilers:

这可以通过对我们的程序进行概要分析来确定瓶颈来完成。 由于这是一个巨大的主题,因此我不会深入研究细节,而将使用以下探查器

  • %time- Time the execution of a single statement ⌛.

    %time-定时执行单个语句⌛。

  • %timeit- Like time, but repeated for more accuracy ⌛.

    %timeit-类似于时间,但重复以提高准确性⌛。

  • %memit- Measure the memory use of a single statement 💾.

    %memit-测量单个语句the的内存使用。

  • %mprun- Run code with the line-by-line memory profiler 💾.

    %mprun-使用逐行内存分析器Run运行代码。

Apart from finding the part of code that needs refactoring, we want the refactoring itself to be safe. Like every refactoring task, you want to have the same behavior and return the same result. The best way to achieve this is to make sure the code is Well Tested.

除了找到需要重构的代码部分之外,我们还希望重构本身是安全的。 像每个重构任务一样,您希望具有相同的行为并返回相同的结果 。 实现此目的的最佳方法是确保代码经过良好测试

The next question I am going to tackle is, “is it possible to optimize our python code? Python is a dynamic language and lacks a lot of compilation optimizations”.

我要解决的下一个问题是:“是否可以优化我们的python代码? Python是一种动态语言,缺乏很多编译优化”。

可能吗? 🦾 (Is it possible? 🦾)

People tend to think the problem lies in python realms, and that there is nothing we can do about it.

人们倾向于认为问题出在python领域,而我们对此无能为力。

Image for post
https://www.reddit.com/r/ProgrammerHumor/comments/9cdj7z/nah_dude_i_dont_think_python_is_slow_my_app_runs/ https://www.reddit.com/r/ProgrammerHumor/comments/9cdj7z/nah_dude_i_dont_think_python_is_slow_my_app_runs/

But, this is not the reality, and this article aims to show how to reach optimized pandas code.

但是,这不是现实,本文旨在说明如何获得优化的熊猫代码。

Next, we will tackle the one million dollar question, “how can we optimize our Panda’s code?”.

接下来,我们将解决一百万美元的问题,“ 我们如何优化我们的熊猫代码?”

怎么👀 (How 👀)

Important note: Every technique has an icon that indicates whether it should improve performance ⌛ and/or memory footprint 💾.

重要说明:每种技术都有一个图标,指示是否应提高性能⌛和/或内存占用💾。

I am going to list the techniques I am gonna cover today.

我将列出我今天要讲的技术。

  • Use What You Need 💾⌛

    使用您需要的东西💾⌛
  • Don’t Reinvent the Wheel ⌛💾

    不要重新发明轮子⌛💾
  • Avoid Loops ⌛

    避免循环⌛
  • Picking the Right Types 💾⌛

    选择正确的类型💾⌛
  • Pandas Usage ⌛💾

    熊猫用法⌛💾
  • Compiled Code ⌛

    编译代码⌛
  • General Python Optimisations ⌛💾

    常规Python优化⌛💾
  • Pandas Alternatives ⌛💾

    熊猫替代品⌛💾

Before I begin, it’s important to state that all these optimizations depend on the characteristic of your dataset. For example, for small datasets, some of these optimizations might be irrelevant.

在开始之前,重要的是要声明所有这些优化取决于数据集的特征。 例如,对于小型数据集,其中某些优化可能是不相关的。

使用您需要的东西🧑 (Use What You Need 🧑)

  • Load needed columns only- removing all the columns we don’t use in our data analysis/manipulation can be a huge memory saver.

    仅加载所需的列-删除我们在数据分析/操作中不使用的所有列可以节省大量内存。

  • Load needed rows only- removing all the rows we don’t use in our data analysis/manipulation can be a huge memory and execution time saver.

    仅加载所需的行-删除我们在数据分析/操作中不使用的所有行可以节省大量内存和执行时间。

Although this seems basic, it can have tremendous effects, and people seem to skip it. You should be like this cute fellow and only use what you need.

尽管这看起来很基础,但可能会产生巨大的影响,人们似乎跳过了它。 您应该像这个可爱的家伙,只使用您需要的东西

Image for post
https://knowyourmeme.com/photos/1092618-image-macros https://knowyourmeme.com/photos/1092618-image-macros

不要重新发明轮子。 🎡 (Don’t Reinvent the Wheel. 🎡)

  • Vast ecosystem- there endless related packages and tutorials. Probably someone has already done what you are looking for.

    广阔的生态系统-有无尽的相关软件包和教程。 可能有人已经完成了您想要的工作。

  • Use existing solutions- there are a lot of mature packages out there, and using those will result in fewer bugs and more optimized code (they are written in highly optimized C/Fortran with just python bindings).

    使用现有的解决方案-有很多成熟的程序包,使用这些程序包将导致更少的错误和更优化的代码(它们使用高度优化的C / Fortran编写,仅包含python绑定)。

For example, instead of implementing means by yourself, you should use either scipy or skit-learn implementation.

例如,您应该使用scipy或Skit -learn实现,而不是自己实现手段。

Image for post
https://imgflip.com/i/1px0z1 https://imgflip.com/i/1px0z1

避免循环♾ (Avoid Loops ♾)

Pandas is designed for vector manipulations. Vectorization is the process of executing operations on entire arrays, Which makes loops inefficient.

Pandas专为矢量操纵而设计。 向量化是在整个数组上执行操作的过程, 这会使循环效率低下

不好的选择😈 (Bad Option 😈)

A rookie mistake in pandas will be to just loop over all the rows” by either using errors or regular loops.

大熊猫的菜鸟错误将是通过使用错误或常规循环来循环遍历所有行。

In the following snippet, we are calculating the original meal price (without the tip) by subtracting the tip from the meal price itself.

在以下代码段中,我们通过从餐费价格中减去小费来计算原始餐费(无小费)。

def iterrows_original_meal_price(df):
for i, row in df.iterrows():
row["orig_meal_price"] = row["meal_price"] - row["meal_tip"]
return df%%timeit -r 1 -n 1
iterrows_original_meal_price(df)35min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

As you can see, the execution time is around 35minutes, unsatisfying results indeed. But don’t worry, as I said, using iterrows is a rookie mistake, and I am here to show you a much better approach.

如您所见,执行时间大约为35分钟, 确实令人不满意。 但是,不必担心,正如我所说,使用Iterrows是菜鸟的错误,我在这里向您展示一种更好的方法。

更好的选择🤵 (Better Option 🤵)

Fortunately, there is a much nicer way, using apply. Apply accepts any user-defined function that applies a transformation/aggregation on a DataFrame (iterative).

幸运的是,有一个使用apply的更好的方法。 Apply接受在DataFrame上应用转换/聚合的任何用户定义函数(迭代)。

def calc_
return row['meal_price'] - row['meal_tip']def apply_original_meal_price(df):
df["orig_meal_price"] = df.apply(
return df%%timeit
apply_original_meal_price(df)22.5 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As we can see, the performance boost here is insane, instead of 35 minutes, the same program took us 21 seconds which is much better. I will gladly take the 100x Improvement In Execution Time over iterrows⌛. So the lesson is that Iterrows is pure evil 😈.

我们可以看到,这里的性能提升是疯狂的,而不是35分钟,同一程序花了我们21秒,这要好得多。 我会很高兴将iterrows的执行时间提高100倍。 因此,教训是Iterrows是纯粹的邪恶 😈。

But is all can we do? Can’t we make the same simple code extremely fast? Well, it can be done, and now I am going to show you the best way, aka vectorizations.

但是我们能做些什么吗? 我们不能快速地制作相同的简单代码吗? 好吧,这是可以做到的,现在我将向您展示最佳方法,也就是矢量化。

最佳选择👼 (Best Option 👼)

As a reminder, vectorization is a process of executing operations on entire arrays. Pandas/NumPy/SciPy includes a generous collection of vectorized functions from mathematical operations to aggregations.

提醒一下,向量化是在整个阵列上执行操作的过程。 Pandas / NumPy / SciPy包括从数学运算到聚合的大量矢量化函数集合。

In the following snippets, we are going to subtract the entire meal_tip column from the entire meal_price column.

在以下代码段中,我们将从整个进餐价格栏中减去整个进餐 提示

def vectorized_original_meal_price(df):
df["orig_meal_price"] = df["meal_price"] - df["meal_tip"]
return df%%timeit
vectorized_original_meal_price(df)2.46 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s insane. We can see the benefit of vectorized function right away. We got to 2.5 milliseconds. And about 100,00. 8000x Improvement In Execution Time over the apply method⌛. So the lesson is that vectorized operations rules 😇.

太疯狂了 我们可以马上看到向量化函数的好处。 我们达到了2.5毫秒。 约100,00。 与应用方法 ⌛相比, 执行时间提高 8000倍。 因此,教训是向量化运算规则 rules。

选择正确的类型🌈 (Picking the Right Type 🌈)

I am going to show the motivation for picking the right type can reduce our memory footprint🏆.

我将展示选择正确类型的动机可以减少我们的内存占用。

First, I am going to create an array that concludes of 5,000 the number 1 over and over again.

首先,我将创建一个数组,该数组一次又一次地得出5000个数字1。

ones = np.ones(shape=5000)
onesarray([1., 1., 1., ..., 1., 1., 1.])

Then I am going to cast it to different types and show how does a simple type change can drastically change the memory footprint.

然后,我将其转换为不同的类型,并说明简单的类型更改如何能够极大地改变内存占用量。

types = ['object', 'float64', 'int64', 'int16', 'int8', 'bool']
df = pd.DataFrame({t: ones.astype(t) for t in types})
df.memory_usage(index=False, deep=True)object 160000
float64 40000
int64 40000
int16 10000
int8 5000
bool 5000
dtype: int64

As you can see, picking the right type can bring us 80 times improvement in memory improvement, which is pretty insane. I hope it’s clear that the type of the column effect the memory footprint.

如您所见,选择正确的类型可以使我们的内存改进提高80倍,这简直太疯狂了。 我希望很明显,列的类型会影响内存占用量。

Now that we understand the motivation to optimize our dataset types and let’s see our entire dataset memory footprint and how its distributed column by column🌈.

现在,我们了解了优化数据集类型的动机,并让我们了解了整个数据集的内存占用以及它如何逐列分布。

df.memory_usage(deep=True).sum()478844140df.memory_usage(deep=True)Index                 8002720
order_id 73024820
date 67022780
date_of_meal 82027880
participants 84977580
meal_price 36012240
type_of_meal 63688760
heroes_adjustment 32076480
meal_tip 32010880
dtype: int64

So it’s pretty obvious that we should aim for the type that has the lowest memory footprint with the same functionality. I am gonna describe the supported types.

因此很明显,我们应该针对具有相同功能的内存占用最少的类型。 我将描述受支持的类型。

支持的类型🌈 (Supported Types 🌈)

  • int — integer number.

    int —整数。

  • float — floating-point numbers.

    float —浮点数。

  • bool — boolean True and False values.

    bool —布尔值True和False值。

  • objects — strings or mixed types.

    对象 -字符串或混合类型。

  • string — strings (new in version 1.0.0).

    string —字符串(版本1.0.0中的新增功能)。

  • DateTime — date and time values.

    DateTime-日期和时间值。

  • timedelta — time difference between two datetimes.

    timedelta-两个日期时间之间的时差。

  • Category — limited list of values stored a memory-efficient lookup. It’s is good when the same elements occur over and over again (new in version 0.23).

    类别 -有限的值列表存储了内存有效的查找。 一次又一次地出现相同的元素(0.23版中的新功能)会很好。

  • Sparse Types — The sparse types is good when most of the array includes nulls (new in version 0.24.0).

    稀疏类型 -当数组的大部分包含空值时,稀疏类型是好的(0.24.0版中的新功能)。

  • Nullable Integer/Nullable Boolean — The Nullable type is good when elements are integer / boolean and include nulls. This is because NaN is a float, and it forces the entire array to be cast as float and thus has a bigger memory footprint (new in version 0.24.0).

    Nullable Integer / Nullable Boolean —当元素为整数/布尔值并且包含空值时,Nullable类型是好的。 这是因为NaN是浮点数,它迫使整个数组强制转换为浮点数,因此具有更大的内存占用量(0.24.0版中的新功能)。

Since pandas are using NumPy arrays as its backend structures, the ints and floats can be differentiated into more memory efficient types like int8, int16, int32, int64, unit8, uint16, uint32, and uint64 as well as float32 and float64.

由于熊猫使用NumPy数组作为其后端结构,因此可以将intfloat分为内存效率更高的类型,例如int8int16int32int64unit8uint16uint32,uint64以及float32float64

If that’s not enough, one can use custom types using extension array, though optimization requires a lot of effort and skills 🦸🏼🦸‍♀️. If you choose to do so you can use either:

如果这还不够,可以通过扩展数组使用自定义类型,尽管优化需要大量的精力和技巧。 如果选择这样做,则可以使用以下任一方法:

我们用于优化类型的选项🌈 (Our Options for Optimizing Types 🌈)

  • Loading DataFrames with specific types (best way).

    加载具有特定类型的DataFrame(最佳方法)。
  • Use astype method.

    使用astype方法。
  • Use to_x method with the downcast parameter.

    使用带有downcast参数的to_x方法。
df = df.astype({'order_id': 'category',
'date': 'category',
'date_of_meal': 'category',
'participants': 'category',
'meal_price': 'int16',
'type_of_meal': 'category',
'heroes_adjustment': 'bool',
'meal_tip': 'float32'})df.memory_usage(deep=True).sum()36999962

That’s insane, we can see the benefit of picking the right types got us to 3.7 Megabytes which is 12x Improvement in Memory over the naive types.

太疯狂了,我们可以看到选择正确类型的好处使我们达到3.7 MB,这比朴素类型的内存提高了12倍

Not only that we can get execution improvement of some of the mathematical methods like Mean/Sum/Mode/Min/etc optimization 🧮.

我们不仅可以改善一些数学方法的执行效率,例如均值/和/模式/最小/等优化🧮。

%%timeit
df["meal_price_with_tip"].astype(object).mean()96 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeit
df["meal_price_with_tip"].astype(float).mean()4.27 ms ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Again, we can see the benefit of picking the right types got us from 96 ms to 4.3 ms which is 20x Improvement in Execution Time over the naive types⌛.

再次,我们可以看到选择正确类型的好处使我们从96毫秒变为4.3毫秒,这比朴素类型 ⌛的执行时间缩短 20倍

Image for post
https://knowyourmeme.com/photos/549339-rule-34 https://knowyourmeme.com/photos/549339-rule-34

熊猫用法🐼 (Pandas Usage 🐼)

  1. Concat vs Append ➕ — Every append creates a new dataframe object, thus multiple appends become inefficient and one should use concat, for few changes append might be faster.

    Concat vs Append➕—每个附录都会创建一个新的数据对象,因此多个附录会变得效率低下,因此应使用concat,因为很少的更改可能会更快。

  2. Sorting optimization 📟 — pandas sort has an additional argument for which algorithm should it use, also with GPU probably PyTorch/Tensorflow sorting will be faster:

    排序优化📟—熊猫排序还有一个额外的参数,应该针对哪种算法使用它,对于GPU来说,PyTorch / Tensorflow排序可能会更快:

%%timeit
df.sort_values(["meal_price_with_tip", "meal_tip", "type_of_meal"], kind='quicksort')147 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeit
df.sort_values(["meal_price_with_tip", "meal_tip", "type_of_meal"], kind='mergesort')147 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeit
df.sort_values(["meal_price_with_tip", "meal_tip", "type_of_meal"], kind='heapsort')147 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3) Chunks 🍰 — Splitting large datasets to smaller parts will allow us to work with large datasets (much larger than memory), as long as there are not too many interactions between the chunks.

3)块-将大数据集拆分为较小的部分将使我们能够使用大数据集(比内存大得多),只要块之间的交互作用不大即可。

def proccess_file(huge_file_path, chunksize = 10 ** 6):
for chunk in pd.read_csv(path, chunksize=chunksize):
process(chunk)

4) GroupBy Optimizations 👩‍👩‍👧:

4)GroupBy Optimizations 👧‍👩‍👧:

  • Filter early

    尽早过滤
  • Custom functions are slow

    自定义功能很慢
  • Extract logic for custom functions when possible

    尽可能提取自定义函数的逻辑

5) Merge Optimization 🔍:

5)合并优化🔍:

6) DataFrame Serialization 🏋 — various file formats have different advantages including saving and loading times.

6)DataFrame序列化🏋-各种文件格式具有不同的优点,包括保存和加载时间。

Image for post
https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

7) Query/Eval 🧬

7)查询/评估val

  • Use numexpr if installed.

    如果已安装,请使用numexpr

  • Improve Execution Time — The expected behavior is up to 2-times faster👍.

    改善执行时间—预期的行为快2倍👍。
  • Improve Memory — NumPy allocates memory to every intermediate step, and by using numexpr it computes the same expressions without the need to allocate full intermediate arrays 👍.

    改善内存— NumPy将内存分配给每个中间步骤,并且使用numexpr可以计算相同的表达式,而无需分配完整的中间数组intermediate。
  • Not all Operations are supported 👎

    并非所有操作都受支持👎
%%timeit
df[df.type_of_meal=="Breakfast"]103 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeit
df.query("type_of_meal=='Breakfast'")82.4 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pretty cool, % 20 Performance Improvement⌛.Important note, The pandas API indicates to use query/eval only on medium-sized datasets or bigger (more than 10,000 rows, as the traditional method is faster for smaller arrays 🧞.

非常酷,性能提高了20%。重要的一点是, pandas API指示仅对中型或更大的数据集(超过10,000行,因为传统方法对较小的数组更快)使用查询/评估。

编译代码🤯 (Compiled Code 🤯)

Due to its dynamic nature, pure Python’s code does some operations very slowly. This is because the sequences of operations cannot be compiled down to efficient machine code as in other languages like C and Fortran.

由于其动态特性,纯Python的代码执行某些操作的速度非常慢。 这是因为无法像其他语言(如C和Fortran)那样将操作序列编译为有效的机器代码。

To show the performance boost, I am going to create a pure python method called pure_python_foo, which accumulates all numbers until a given number.

为了显示性能提升,我将创建一个称为pure_python_foo的纯python方法,该方法将所有数字累加直到给定数字。

def pure_python_foo(N):
accumulator = 0
for i in range(N):
accumulator = accumulator + i
return accumulator%%timeit
df.meal_price_with_tip.map(foo)17.9 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each

As we can this simple method takes 18 seconds to run, it’s simply unacceptable. Fortunately, there are various attempts to add some compilation magic to address this weakness 🧙:

由于我们可以用这种简单的方法运行18秒,因此完全不可接受。 幸运的是,有各种尝试添加一些编译魔术来解决此弱点🧙:

  • Cython — converts Python code to compatible C code:

    Cython —将Python代码转换为兼容的C代码:

    - Up to 50x speedup from pure Python 👍.

    -相对于纯Python speed,速度提高了50倍。

    - Learning Curve 👎.

    -学习曲线👎。

    - Requires additional work to integrate into the code due to the separated compilation step

    -由于分开的编译步骤,需要额外的工作才能集成到代码中

    👎.

    👎

    - There is no compilation overhead on runtime due to the separated compilation step

    -由于分离了编译步骤,因此在运行时没有编译开销

    👍.

    👍

  • Numba — converts Python code to fast LLVM byte-code:

    Numba —将Python代码转换为快速LLVM字节代码:

    - Up to 200x speedup from pure Python 👍.

    -相对于纯Python speed,加速高达200倍。

    - Easy- simply adding a decorator to a method👍.

    -简单-只需在方法中添加装饰器即可。

    - Highly Configurable 👍.

    -高度可配置的。

    - Mostly Numeric 👎.

    -大多数为数字👎。

    - Debugging the business logic is easy as we can just remove the decorator and then we debug it as regular python code

    -调试业务逻辑很容易,因为我们可以删除装饰器,然后将其调试为常规python代码

    👍.

    👍

These are the Cython and Numba examples:

这些是Cython和Numba的示例:

%%cython
def cython_foo(long N):
cdef long accumulator
accumulator = 0 cdef long i
for i in range(N):
accumulator += i return accumulator%%timeit
df.meal_price_with_tip.map(cython_foo)365 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)@jit(nopython=True)
def numba_foo(N):
accumulator = 0
for i in range(N):
accumulator = accumulator + i
return accumulator%%timeit
df.meal_price_with_tip.map(numba_foo)414 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see we got 49 times performance improvement with Cython and 43 times performance improvement using Numba ⌛.

如您所见,Cython的性能提高了49倍,Numba⌛的性能提高了43倍

As a general rule of thumb, you should first try Vectorised methods, then Numba if it’s not enough and only then Cython.

作为一般经验法则,您应该首先尝试使用Vectorized方法,然后尝试Numba(如果还不够的话),然后再尝试Cython。

常规Python优化🐍 (General Python Optimisation 🐍)

Since we use python methods when writing pandas code, knowing regular python optimization can lead to nice improvements.

由于我们在编写熊猫代码时使用python方法,因此了解常规的python优化可以带来不错的改进。

Since it’s a huge topic I will only give a bird high view over some techniques, For more techniques go to read “High-Performance Python Book” 📖.

由于这是一个巨大的主题,因此我只会对某些技术给予全面的介绍。有关更多技术,请阅读“ 高性能Python书” Book。

缓存🏎 (Caching 🏎)

发电机 (Generators)

中间变量👩‍👩‍👧‍👧 (Intermediate Variables👩‍👩‍👧‍👧)

  • Intermediate calculations

    中间计算
  • The memory footprint of both objects

    两个对象的内存占用量
  • Smarter variables allocation

    更智能的变量分配
def another_foo(data):
return data * 2def foo(data):
return data + 10%reload_ext memory_profilerdef load_data():
return np.ones((2 ** 30), dtype=np.uint8)%%memit
def proccess():
data = load_data()
return another_foo(foo(data))proccess()peak memory: 8106.62 MiB, increment: 3042.64 MiB%%memit
def proccess():
data = load_data()
data = foo(data)
data = another_foo(data)
return dataproccess()peak memory: 7102.64 MiB, increment: 2038.66 MiB

并发与并行 (Concurrency and Parallelism 🎸🎺🎻🎷)

  • Pandas methods use a single process.

    熊猫方法使用单个过程。
  • CPU-bound can benefit from parallelism instead of sequential execution.

    CPU绑定可以受益于并行性而不是顺序执行。
  • IO-bound can benefit from concurrency, either multithreading or asynchronous execution.

    IO绑定可以受益于并发(多线程或异步执行)。

熊猫替代品🐨🐻 (Pandas Alternatives 🐨🐻)

If all of these techniques won’t suffice for you, you probably should use a different Dataframe API:

如果所有这些技术都无法满足您的需求,则您可能应该使用其他Dataframe API:

  • cudf — a dataframe API that supports GPU.

    cudf —支持GPU的数据框API。

  • pyspark — python binding for apache spark.

    pyspark —用于Apache Spark的 python绑定。

  • modin — abstraction over dask or ray to parallelize pandas across multiple machines.

    大黄素 -抽象了DASK或射线在多台计算机并行化大熊猫。

Like everything in life, There is No free lunch 🥢. Every one of these has its limitations and before you pick one over the other you should do your homework.

就像生活中的一切一样, 这里没有免费的午餐 🥢。 这些中的每一个都有其局限性,在选择一个之前,您应该做作业。

最后的话 (Last words)

In this article, we reviewed some of the most common pitfalls that can cause otherwise perfectly good Pandas code to be too slow for any time-sensitive applications, and walk through a set of tips and tricks to avoid them. Due to the extent of the topic, there are many things I covered briefly. For this reason, I have added additional resources in the end if you want to go the extra mile.

在本文中,我们回顾了一些最常见的陷阱,这些陷阱可能会使原本不错的Pandas代码对于任何对时间敏感的应用程序都变得太慢,并逐步介绍了避免它们的技巧和窍门。 由于主题的范围,我简要介绍了许多内容。 因此,如果您想加倍努力,最后我添加了其他资源。

I hope I was able to share my enthusiasm for this fascinating topic and that you find it useful, and as always I am open to any kind of constructive feedback.

我希望我能够与我分享这个有趣主题的热情,并且您发现它很有用,并且一如既往,我愿意接受任何形式的建设性反馈。

翻译自: https://medium.com/towards-artificial-intelligence/comprehensive-guide-to-optimize-your-pandas-code-62980f8c0e64

熊猫烧香代码

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值