熊猫烧香代码_优化熊猫代码的综合指南

最新推荐文章于 2022-07-05 13:11:33 发布

李_涛

最新推荐文章于 2022-07-05 13:11:33 发布

阅读量1.1k

点赞数 1

文章标签： python java 算法

原文链接：https://medium.com/towards-artificial-intelligence/comprehensive-guide-to-optimize-your-pandas-code-62980f8c0e64

版权

熊猫烧香代码

数据科学 (Data Science)

In this to guide, I am going to show you some of the most common pitfalls that can cause otherwise perfectly good Pandas code to be too slow for any time-sensitive applications, and walk through a set of tips and tricks to avoid them.

在本指南中，我将向您展示一些最常见的陷阱，这些陷阱可能会使本来很好的Pandas代码对于任何对时间敏感的应用程序都变得太慢，并逐步介绍了避免它们的技巧和窍门。

Let’s remind ourselves what is pandas, apart from a cute animal 🐼. Its a widely used library for data analysis and manipulation that load all the data into RAM.

除了可爱的动物🐼，让我们提醒自己什么是熊猫。它是用于数据分析和处理的广泛使用的库，可将所有数据加载到RAM中。

In this article, I am going to use a dataset that contains meal invoices (one million rows) 📉.

在本文中，我将使用一个包含膳食发票(一百万行)的数据集。

df = load_dataset()
df.head()

为什么要表现🤨 (Why Performance🤨)

Fast is better than slow- because no one loves to wait for his code to be executed 🐇.
快比慢要好-因为没有人喜欢等待他的代码被执行🐇。
Memory efficiency is good- because “Out of memory” exceptions are scary. 💾
内存效率很高-因为“内存不足”异常很可怕。 💾
Saving money is awesome- by using less powerful machines, we can reduce our AWS/GCP costs 💸.
省钱真是棒极了-通过使用功能更弱的计算机，我们可以降低AWS / GCP的成本💸。
Hardware will only take you so far- there is a limit to hardware performance 💻.
硬件只会带您到此为止-硬件性能有限。

Ok, now that I have “sold” you that we should care about performance, the next question I want to tackle is when should we optimize our code. Spoiler alert the surprising answer is: “NOT ALWAYS.”

好的，既然我已经“出售”了您我们应该关心的性能，那么我要解决的下一个问题是何时应该优化代码。扰流板警报的令人惊讶的答案是：“永远不会。”

Image for post — https://www.pinterest.ie/pin/764415736723944164/ https://www.pinterest.ie/pin/764415736723944164/

何时优化⏰ (When to Optimize⏰)

Since program readability is our top priority, as we aim to make the programmer’s life easier, we should only optimize our code when needed, and in other words “all optimizations are premature unless”:

由于程序可读性是我们的首要任务，因此我们旨在简化程序员的生活，因此，我们仅应在需要时对代码进行优化，换句话说，“ 除非是所有优化，否则都为时过早” ：

The program doesn’t meet requirements- whether it’s too slow for the user or whether it’s taking too much memory 🚔.
该程序不符合要求-对用户来说太慢还是占用太多内存🚔。
Program execution effects development pace- if the program is slow, then it will affect developer productivity, which will make each feature much longer to develop👷.
程序执行会影响开发速度-如果程序运行缓慢，则会影响开发人员的工作效率，这会使每个功能的开发时间更长👷。

Since optimizing our code can be time-consuming, we should refactor only the problematic parts.

由于优化我们的代码可能很耗时，因此我们应该仅重构有问题的部分。

This can be done by profiling our program to identify the bottlenecks. Since this is a huge topic, I won’t delve into details and will use the following profilers:

这可以通过对我们的程序进行概要分析来确定瓶颈来完成。由于这是一个巨大的主题，因此我不会深入研究细节，而将使用以下探查器 ：

%time- Time the execution of a single statement ⌛.
%time-定时执行单个语句⌛。
%timeit- Like time, but repeated for more accuracy ⌛.
%timeit-类似于时间，但重复以提高准确性⌛。
%memit- Measure the memory use of a single statement 💾.
%memit-测量单个语句the的内存使用。
%mprun- Run code with the line-by-line memory profiler 💾.
%mprun-使用逐行内存分析器Run运行代码。

Apart from finding the part of code that needs refactoring, we want the refactoring itself to be safe. Like every refactoring task, you want to have the same behavior and return the same result. The best way to achieve this is to make sure the code is Well Tested.

除了找到需要重构的代码部分之外，我们还希望重构本身是安全的。像每个重构任务一样，您希望具有相同的行为并返回相同的结果 。实现此目的的最佳方法是确保代码经过良好测试。

The next question I am going to tackle is, “is it possible to optimize our python code? Python is a dynamic language and lacks a lot of compilation optimizations”.

我要解决的下一个问题是：“是否可以优化我们的python代码？ Python是一种动态语言，缺乏很多编译优化”。

可能吗？ 🦾 (Is it possible? 🦾)

People tend to think the problem lies in python realms, and that there is nothing we can do about it.

人们倾向于认为问题出在python领域，而我们对此无能为力。

But, this is not the reality, and this article aims to show how to reach optimized pandas code.

但是，这不是现实，本文旨在说明如何获得优化的熊猫代码。

Next, we will tackle the one million dollar question, “how can we optimize our Panda’s code?”.

接下来，我们将解决一百万美元的问题，“ 我们如何优化我们的熊猫代码？” 。

怎么👀 (How 👀)

Important note: Every technique has an icon that indicates whether it should improve performance ⌛ and/or memory footprint 💾.

重要说明：每种技术都有一个图标，指示是否应提高性能⌛和/或内存占用💾。

I am going to list the techniques I am gonna cover today.

我将列出我今天要讲的技术。

Use What You Need 💾⌛
使用您需要的东西💾⌛
Don’t Reinvent the Wheel ⌛💾
不要重新发明轮子⌛💾
Avoid Loops ⌛
避免循环⌛
Picking the Right Types 💾⌛
选择正确的类型💾⌛
Pandas Usage ⌛💾
熊猫用法⌛💾
Compiled Code ⌛
编译代码⌛
General Python Optimisations ⌛💾
常规Python优化⌛💾
Pandas Alternatives ⌛💾
熊猫替代品⌛💾

Before I begin, it’s important to state that all these optimizations depend on the characteristic of your dataset. For example, for small datasets, some of these optimizations might be irrelevant.

在开始之前，重要的是要声明所有这些优化取决于数据集的特征。例如，对于小型数据集，其中某些优化可能是不相关的。

最低0.47元/天解锁文章

李_涛

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
熊猫烧香代码_优化熊猫代码的综合指南

熊猫烧香代码数据科学 (Data Science)In this to guide, I am going to show you some of the most common pitfalls that can cause otherwise perfectly good Pandas code to be too slow for any time-sensitive applicati...
复制链接

扫一扫