熊猫vr_快速,灵活,轻松和直观:如何加快您的熊猫项目

熊猫vr

If you work with big data sets, you probably remember the “aha” moment along your Python journey when you discovered the Pandas library. Pandas is a game-changer for data science and analytics, particularly if you came to Python because you were searching for something more powerful than Excel and VBA.

如果您使用大数据集,那么您可能还记得当您发现Pandas库时Python旅程中的“ aha”时刻。 Pandas改变了数据科学和分析的格局,尤其是因为您要搜索比Excel和VBA更强大的功能而加入Python时,尤其如此。

So what is it about Pandas that has data scientists, analysts, and engineers like me raving? Well, the Pandas documentation says that it uses:

那么,像我这样的数据科学家,分析师和工程师对此赞不绝口呢? 好吧,Pandas文档说它使用:

fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.”

快速灵活和富于表现力的数据结构,旨在使使用“关系”或“标记”数据既简单直观

Fast, flexible, easy, and intuitive? That sounds great! If your job involves building complicated data models, you don’t want to spend half of your development hours waiting for modules to churn through big data sets. You want to dedicate your time and brainpower to interpreting your data, rather than painstakingly fumbling around with less powerful tools.

快速,灵活,简单和直观? 这听起来很不错! 如果您的工作涉及构建复杂的数据模型,则您不想花费一半的开发时间来等待模块遍历大数据集。 您想要花费时间和精力来解释数据,而不是费力地使用功能较弱的工具。

但是我听说熊猫慢... (But I Heard That Pandas Is Slow…)

When I first started using Pandas, I was advised that, while it was a great tool for dissecting data, Pandas was too slow to use as a statistical modeling tool. Starting out, this proved true. I spent more than a few minutes twiddling my thumbs, waiting for Pandas to churn through data.

刚开始使用Pandas时,我被告知,虽然Pandas是剖析数据的好工具,但Pandas太慢了,无法用作统计建模工具。 首先,这被证明是正确的。 我花了超过几分钟的时间来摆弄我的拇指,等待熊猫读取数据。

But then I learned that Pandas is built on top of the NumPy array structure, and so many of its operations are carried out in C, either via NumPy or through Pandas’ own library of Python extension modules that are written in Cython and compiled to C. So, shouldn’t Pandas be fast too?

但是后来我了解到Pandas是在NumPy数组结构的基础上构建的,所以它的许多操作都是通过NumPy或通过Pandas自己的用Cython编写并编译为C的Python扩展模块在C中执行的。所以,熊猫也应该不快吗?

It absolutely should be, if you use it the way it was intended!

如果您按预期方式使用它,绝对应该如此!

The paradox is that what may otherwise “look like” Pythonic code can be suboptimal in Pandas as far as efficiency is concerned. Like NumPy, Pandas is designed for vectorized operations that operate on entire columns or datasets in one sweep. Thinking about each “cell” or row individually should generally be a last resort, not a first.

矛盾的是,就效率而言,在Python中 ,可能看起来像Python代码的代码可能不是最理想的。 与NumPy一样, Pandas设计用于向量化操作 ,一次扫描即可对整个列或数据集进行操作。 单独考虑每个“单元”或行通常应该是最后的选择,而不是首先的选择。

本教程 (This Tutorial)

To be clear, this is not a guide about how to over-optimize your Pandas code. Pandas is already built to run quickly if used correctly. Also, there’s a big difference between optimization and writing clean code.

需要明确的是,这不是有关如何过度优化Pandas代码的指南。 如果正确使用,Pandas已经可以快速运行。 另外,优化和编写干净的代码之间也有很大的区别。

This is a guide to using Pandas Pythonically to get the most out of its powerful and easy-to-use built-in features. Additionally, you will learn a couple of practical time-saving tips, so you won’t be twiddling those thumbs every time you work with your data.

这是使用Python熊猫以充分利用其强大且易于使用的内置功能的指南。 此外,您将学到一些节省时间的实用技巧,因此您不会在每次使用数据时都花些时间。

In this tutorial, you’ll cover the following:

在本教程中,您将介绍以下内容:

  • Advantages of using datetime data with time series
  • The most efficient route to doing batch calculations
  • Saving time by storing data with HDFStore
  • datetime时间数据与时间序列一起使用的优点
  • 批处理计算的最有效途径
  • 通过HDFStore存储数据节省时间

To demonstrate these topics, I’ll take an example from my day job that looks at a time series of electricity consumption. After loading the data, you’ll successively progress through more efficient ways to get to the end result. One adage that holds true for most of Pandas is that there is more than one way to get from A to B. This doesn’t mean, however, that all of the available options will scale equally well to larger, more demanding datasets.

为了演示这些主题,我将以我的日常工作为例,着眼于电力消耗的时间序列。 加载数据后,您将逐步采用更有效的方法来获得最终结果。 对于大多数熊猫来说,正确的格言是从A到B的获取方法不止一种。但这并不意味着所有可用选项都将同样好地缩放到更大,要求更高的数据集。

Assuming that you already know how to do some basic data selection in Pandas, let’s get started.

假设您已经知道如何在Pandas中进行一些基本数据选择 ,那就开始吧。

手头的任务 (The Task at Hand)

The goal of this example will be to apply time-of-use energy tariffs to find the total cost of energy consumption for one year. That is, at different hours of the day, the price for electricity varies, so the task is to multiply the electricity consumed for each hour by the correct price for the hour in which it was consumed.

此示例的目标是应用分时使用的能源费率来查找一年的能源消耗总成本。 也就是说,在一天的不同时段,电价会有所不同,因此任务是将每小时所消耗的电量乘以该小时所消耗小时的正确价格。

Let’s read our data from a CSV file that has two columns: one for date plus time and one for electrical energy consumed in kilowatt hours (kWh):

让我们从具有两列的CSV文件中读取数据:一列用于日期和时间,另一列用于以千瓦时(kWh)消耗的电能:

CSV data

The rows contains the electricity used in each hour, so there are 365 x 24 = 8760 rows for the whole year. Each row indicates the usage for the “hour starting” at the time, so 1/1/13 0:00 indicates the usage for the first hour of January 1st.

这些行包含每小时使用的电量,因此全年共有365 x 24 = 8760行。 每行指示当时“小时开始”的使用情况,因此1/1/13 0:00指示1月1日第一个小时的使用情况。

使用日期时间数据节省时间 (Saving Time With Datetime Data)

The first thing you need to do is to read your data from the CSV file with one of Pandas’ I/O functions:

您需要做的第一件事是使用Pandas的I / O功能之一从CSV文件中读取数据:

 >>> >>>  import import pandas pandas as as pd
pd
>>> >>>  pdpd .. __version__
__version__
'0.23.1'

'0.23.1'

# Make sure that `demand_profile.csv` is in your
# Make sure that `demand_profile.csv` is in your
# current working directory.
# current working directory.
>>> >>>  df df = = pdpd .. read_csvread_csv (( 'demand_profile.csv''demand_profile.csv' )
)
>>> >>>  dfdf .. headhead ()
()
     date_time  energy_kwh
     date_time  energy_kwh
0  1/1/13 0:00       0.586
0  1/1/13 0:00       0.586
1  1/1/13 1:00       0.580
1  1/1/13 1:00       0.580
2  1/1/13 2:00       0.572
2  1/1/13 2:00       0.572
3  1/1/13 3:00       0.596
3  1/1/13 3:00       0.596
4  1/1/13 4:00       0.592
4  1/1/13 4:00       0.592

This looks okay at first glance, but there’s a small issue. Pandas and NumPy have a concept of dtypes (data types). If no arguments are specified, date_time will take on an object dtype:

乍一看似乎还可以,但是有一个小问题。 Pandas和NumPy具有dtypes (数据类型)的概念。 如果未指定任何参数,则date_time将采用object dtype:

This is not ideal. object is a container for not just str, but any column that can’t neatly fit into one data type. It would be arduous and inefficient to work with dates as strings. (It would also be memory-inefficient.)

这是不理想的。 object不仅是str的容器,而且是不能整齐地适合一种数据类型的任何列的容器。 将日期作为字符串使用将是艰巨而低效的。 (这也将导致内存效率低下。)

For working with time series data, you’ll want the date_time column to be formatted as an array of datetime objects. (Pandas calls this a Timestamp.) Pandas makes each step here rather simple:

为了处理时间序列数据,您需要将date_time列格式化为datetime对象的数组。 (Pandas将此称为Timestamp 。)Pandas使此处的每个步骤变得非常简单:

 >>> >>>  dfdf [[ 'date_time''date_time' ] ] = = pdpd .. to_datetimeto_datetime (( dfdf [[ 'date_time''date_time' ])
])
>>> >>>  dfdf [[ 'date_time''date_time' ]] .. dtype
dtype
datetime64[ns]
datetime64[ns]

(Note that you could alternatively use a Pandas PeriodIndex in this case.)

(请注意,在这种情况下,您也可以使用Pandas PeriodIndex 。)

You now have a DataFrame called df that looks much like our CSV file. It has two columns and a numerical index for referencing the rows.

现在,您有了一个名为df的DataFrame,它看起来很像我们的CSV文件。 它具有两列和一个用于引用行的数字索引。

The code above is simple and easy, but how fast it? Let’s put it to the test using a timing decorator, which I have unoriginally called @timeit. This decorator largely mimics timeit.repeat() from Python’s standard library, but it allows you to return the result of the function itself and print its average runtime from multiple trials. (Python’s timeit.repeat() returns the timing results, not the function result.)

上面的代码很简单,但是又快多少呢? 让我们使用定时装饰器对其进行测试,我@timeit其称为@timeit 。 这个装饰器在很大程度上模仿了Python标准库中的timeit.repeat() ,但是它允许您返回函数本身的结果并从多次试验中打印出其平均运行时间。 (Python的timeit.repeat()返回计时结果,而不是函数结果。)

Creating a function and placing the @timeit decorator directly above it will mean that every time the function is called, it will be timed. The decorator runs an outer loop and an inner loop:

创建一个函数并将@timeit装饰器直接放在其上方将意味着每次调用该函数时,都会对其计时。 装饰器运行一个外循环和一个内循环:

 >>> >>>  @timeit@timeit (( repeatrepeat == 33 , , numbernumber == 1010 )
)
... ...  def def convertconvert (( dfdf , , column_namecolumn_name ):
):
...     ...     return return pdpd .. to_datetimeto_datetime (( dfdf [[ column_namecolumn_name ])

])

>>> >>>  # Read in again so that we have `object` dtype to start 
# Read in again so that we have `object` dtype to start 
>>> >>>  dfdf [[ 'date_time''date_time' ] ] = = convertconvert (( dfdf , , 'date_time''date_time' )
)
Best of 3 trials with 10 function calls per trial:
Best of 3 trials with 10 function calls per trial:
Function `convert` ran in average of 1.610 seconds.
Function `convert` ran in average of 1.610 seconds.

The result? 1.6 seconds for 8760 rows of data. “Great,” you might say, “that’s no time at all.” But what if you encounter larger data sets—say, one year of electricity use at one-minute intervals? That’s 60 times more data, so you’ll end up waiting around one and a half minutes. That’s starting to sound less tolerable.

结果? 8760行数据需要1.6秒。 您可能会说:“太好了,这根本没有时间。” 但是,如果您遇到较大的数据集(例如,每隔一分钟间隔一年的用电量)怎么办? 这是60倍的数据,因此您将需要等待大约一分半钟。 听起来开始难以忍受。

In actuality, I recently analyzed 10 years of hourly electricity data from 330 sites. Do you think I waited 88 minutes to convert datetimes? Absolutely not!

实际上,我最近分析了330个站点的10年每小时电量数据。 您是否认为我等了88分钟才能转换日期时间? 绝对不!

How can you speed this up? As a general rule, Pandas will be far quicker the less it has to interpret your data. In this case, you will see huge speed improvements just by telling Pandas what your time and date data looks like, using the format parameter. You can do this by using the strftime codes found here and entering them like this:

您如何加快速度? 一般而言,Pandas解析数据的速度越快,它就会越快。 在这种情况下,仅通过使用format参数告诉Pandas您的时间和日期数据是什么样子,您就会看到巨大的速度改进。 您可以使用此处找到的strftime代码并按如下所示输入它们来完成此操作:

The new result? 0.032 seconds, which is 50 times faster! So you’ve just saved about 86 minutes of processing time for my 330 sites. Not a bad improvement!

新结果? 0.032秒,快50倍! 因此,您刚刚为我的330个站点节省了大约86分钟的处理时间。 不错的进步!

One finer detail is that the datetimes in the CSV are not in ISO 8601 format: you’d need YYYY-MM-DD HH:MM. If you don’t specify a format, Pandas will use the dateutil package to convert each string to a date.

一个更好的细节是CSV中的日期时间不是ISO 8601格式 :您需要YYYY-MM-DD HH:MM 。 如果您未指定格式,Pandas将使用dateutil包将每个字符串转换为日期。

Conversely, if the raw datetime data is already in ISO 8601 format, Pandas can immediately take a fast route to parsing the dates. This is one reason why being explicit about the format is so beneficial here. Another option is to pass infer_datetime_format=True parameter, but it generally pays to be explicit.

相反,如果原始日期时间数据已经采用ISO 8601格式,则熊猫可以立即采取快速途径来解析日期。 这就是为什么在此处明确说明格式如此有益的原因之一。 另一个选择是传递infer_datetime_format=True参数,但通常需要明确。

简单循环遍历熊猫数据 (Simple Looping Over Pandas Data)

Now that your dates and times are in a convenient format, you are ready to get down to the business of calculating your electricity costs. Remember that cost varies by hour, so you will need to conditionally apply a cost factor to each hour of the day. In this example, the time-of-use costs will be defined as follows:

现在,您可以方便地使用日期和时间格式,因此可以开始计算电费了。 请记住,费用会随小时变化,因此您需要有条件地将费用系数应用于一天中的每个小时。 在此示例中,使用时间成本将定义如下:

Tariff Type 资费类型 Cents per kWh 每千瓦时美分 Time Range 时间范围
Peak28 28 17:00 to 24:00 17:00至24:00
Shoulder20 20 7:00 to 17:00 7:00至17:00
Off-Peak 非高峰 12 12 0:00 to 7:00 0:00至7:00

If the price were a flat 28 cents per kWh for every hour of the day, most people familiar with Pandas would know that this calculation could be achieved in one line:

如果每天的价格为每小时每千瓦时28美分,大多数熟悉Pandas的人都会知道,这一计算可以一行完成:

 >>> >>>  dfdf [[ 'cost_cents''cost_cents' ] ] = = dfdf [[ 'energy_kwh''energy_kwh' ] ] * * 28
28

This will result in the creation of a new column with the cost of electricity for that hour:

这将导致创建一个新列,其中包含该小时的电费:

But our cost calculation is conditional on the time of day. This is where you will see a lot of people using Pandas the way it was not intended: by writing a loop to do the conditional calculation.

但是我们的成本计算取决于一天中的时间。 在这里,您会看到很多人使用Pandas的方式不是想要的:通过编写循环来进行条件计算。

For the rest of this tutorial, you’ll start from a less-than-ideal baseline solution and work up to a Pythonic solution that fully leverages Pandas.

对于本教程的其余部分,您将从一个不理想的基准解决方案开始,然后逐步发展为一个完全利用Pandas的Pythonic解决方案。

But what is Pythonic in the case of Pandas? The irony is that it is those who are experienced in other (less user-friendly) coding languages such as C++ or Java that are particularly susceptible to this because they instinctively “think in loops.”

但是,对于熊猫来说,Pythonic是什么? 具有讽刺意味的是,那些对其他(不太友好的用户)编码语言(例如C ++或Java)有丰富经验的人特别容易受到此影响,因为他们本能地“在循环中思考”。

Let’s look at a loop approach that is not Pythonic and that many people take when they are unaware of how Pandas is designed to be used. We will use @timeit again to see how fast this approach is.

让我们看一下不是Python的循环方法 ,当许多人不知道如何使用Pandas时会采用这种方法。 我们将再次使用@timeit来查看这种方法的速度。

First, let’s create a function to apply the appropriate tariff to a given hour:

首先,让我们创建一个函数,以将适当的费率应用于给定的小时:

 def def apply_tariffapply_tariff (( kwhkwh , , hourhour ):
    ):
    """Calculates cost of electricity for given hour."""    
    """Calculates cost of electricity for given hour."""    
    if if 0 0 <= <= hour hour < < 77 :
        :
        rate rate = = 12
    12
    elif elif 7 7 <= <= hour hour < < 1717 :
        :
        rate rate = = 20
    20
    elif elif 17 17 <= <= hour hour < < 2424 :
        :
        rate rate = = 28
    28
    elseelse :
        :
        raise raise ValueErrorValueError (( ff 'Invalid hour: 'Invalid hour:  {hour}{hour} '' )
    )
    return return rate rate * * kwh
kwh

Here’s the loop that isn’t Pythonic, in all its glory:

这是非Pythonic的循环:

For people who picked up Pandas after having written “pure Python” for some time prior, this design might seem natural: you have a typical “for each x, conditional on y, do z.”

对于之前写过“ pure Python”一段时间后拾起Pandas的人来说,这种设计似乎很自然:您有一个典型的“对于每个x,以y为条件,请做z”。

However, this loop is clunky. You can consider the above to be an “antipattern” in Pandas for several reasons. Firstly, it needs to initialize a list in which the outputs will be recorded.

但是,此循环很麻烦。 出于多种原因,您可以将以上内容视为熊猫的“反模式”。 首先,它需要初始化一个列表,其中将记录输出。

Secondly, it uses the opaque object range(0, len(df)) to loop over, and then after applying apply_tariff(), it has to append the result to a list that is used to make the new DataFrame column. It also does what is called chained indexing with df.iloc[i]['date_time'], which often leads to unintended results.

其次,它使用不透明的对象range(0, len(df))进行循环,然后应用apply_tariff() ,必须将结果附加到用于创建新DataFrame列的列表中。 它还使用df.iloc[i]['date_time']所谓的链式索引 ,这通常会导致意外的结果。

But the biggest issue with this approach is the time cost of the calculations. On my machine, this loop took over 3 seconds for 8760 rows of data. Next, you’ll look at some improved solutions for iteration over Pandas structures.

但是这种方法的最大问题是计算的时间成本。 在我的机器上,此循环花费了3秒以上的时间来处理8760行数据。 接下来,您将看到一些针对Pandas结构进行迭代的改进解决方案。

.itertuples().iterrows()循环 (Looping with .itertuples() and .iterrows())

What other approaches can you take? Well, Pandas has actually made the for i in range(len(df)) syntax redundant by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods. These are both generator methods that yield one row at a time.

您还可以采用哪些其他方法? 嗯,Pandas实际上通过引入DataFrame.itertuples()DataFrame.iterrows()方法使for i in range(len(df))语法中的for i in range(len(df))变得多余。 这些都是产生方法是yield在一次一行。

.itertuples() yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

.itertuples()产生一个namedtuple对于每一行,与行的索引值作为元组的第一个元素。 nametuple元组是Python的collections模块中的数据结构,其行为类似于Python元组,但具有可通过属性查找访问的字段。

.iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame.

.iterrows()为DataFrame中的每一行产生(index, Series )对(元组)。

While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. Let’s see what this achieves:

虽然.itertuples()趋向于更快一些,但在本示例中,让我们留在Pandas中并使用.iterrows() ,因为某些读者可能没有nametuple 。 让我们看看这实现了什么:

 >>> >>>  @timeit@timeit (( repeatrepeat == 33 , , numbernumber == 100100 )
)
... ...  def def apply_tariff_iterrowsapply_tariff_iterrows (( dfdf ):
):
...     ...     energy_cost_list energy_cost_list = = []
[]
...     ...     for for indexindex , , row row in in dfdf .. iterrowsiterrows ():
():
...         ...         # Get electricity used and hour of day
# Get electricity used and hour of day
...         ...         energy_used energy_used = = rowrow [[ 'energy_kwh''energy_kwh' ]
]
...         ...         hour hour = = rowrow [[ 'date_time''date_time' ]] .. hour
hour
...         ...         # Append cost list
# Append cost list
...         ...         energy_cost energy_cost = = apply_tariffapply_tariff (( energy_usedenergy_used , , hourhour )
)
...         ...         energy_cost_listenergy_cost_list .. appendappend (( energy_costenergy_cost )
)
...     ...     # Create new column with cost list
# Create new column with cost list
...     ...     dfdf [[ 'cost_cents''cost_cents' ] ] = = energy_cost_list
energy_cost_list
...
...
>>> >>>  apply_tariff_iterrowsapply_tariff_iterrows (( dfdf )

)

Best of 3 trials with 100 function calls per trial:
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_iterrows` ran in average of 0.713 seconds.
Function `apply_tariff_iterrows` ran in average of 0.713 seconds.

Some marginal gains have been made. The syntax is more explicit, and there is less clutter in your row value references, so it’s more readable. In terms of time gains, is almost 5 five times quicker!

已经取得了一些边际收益。 语法更明确,并且行值引用中的混乱更少,因此更具可读性。 在时间收益方面,快了将近5倍5倍!

However, there is more room for improvement. You’re still using some form of a Python for-loop, meaning that each and every function call is done in Python when it could ideally be done in a faster language built into Pandas’ internal architecture.

但是,还有更多的改进空间。 您仍在使用某种形式的Python for循环,这意味着每个函数调用都可以在Python中完成,而理想情况下,它可以使用内置于Pandas内部体系结构中的更快语言来完成。

熊猫的.apply() (Pandas’ .apply())

You can further improve this operation using the .apply() method instead of .iterrows(). Pandas’ .apply() method takes functions (callables) and applies them along an axis of a DataFrame (all rows, or all columns). In this example, a lambda function will help you pass the two columns of data into apply_tariff():

您可以使用.apply()方法而不是.iterrows()进一步改进此操作。 Pandas的.apply()方法采用函数(可调用)并将其沿DataFrame的轴(所有行或所有列)应用。 在此示例中, lambda函数将帮助您将两列数据传递到apply_tariff()

The syntactic advantages of .apply() are clear, with a significant reduction in the number of lines and very readable, explicit code. In this case, the time taken was roughly half that of the .iterrows() method.

.apply()的语法优势非常明显,显着减少了行数,并且减少了可读性,显式的代码。 在这种情况下,花费的时间大约是.iterrows()方法的一半。

However, this is not yet “blazingly fast.” One reason is that .apply() will try internally to loop over Cython iterators. But in this case, the lambda that you passed isn’t something that can be handled in Cython, so it’s called in Python, which is consequently not all that fast.

但是,这还不是“非常快”。 原因之一是.apply()将在内部尝试遍历Cython迭代器。 但是在这种情况下,您传递的lambda不能在Cython中处理,因此在Python中被称为lambda,因此速度不是那么快。

If you were to use .apply() for my 10 years of hourly data for 330 sites, you’d be looking at around 15 minutes of processing time. If this calculation were intended to be a small part of a larger model, you’d really want to speed things up. That’s where vectorized operations come in handy.

如果要对330个站点的10年每小时数据使用.apply() ,则需要大约15分钟的处理时间。 如果此计算只是较大模型的一小部分,则您确实希望加快速度。 那就是矢量化操作派上用场的地方。

使用.isin()选择数据 (Selecting Data With .isin())

Earlier, you saw that if there were a single electricity price, you could apply that price across all the electricity consumption data in one line of code (df['energy_kwh'] * 28). This particular operation was an example of a vectorized operation, and it is the fastest way to do things in Pandas.

先前,您看到如果有一个单一的电价,则可以将该价格应用于一行代码( df['energy_kwh'] * 28 )中的所有用电量数据。 此特定操作是矢量化操作的一个示例,它是在Pandas中最快的处理方式。

But how can you apply condition calculations as vectorized operations in Pandas? One trick is to select and group parts the DataFrame based on your conditions and then apply a vectorized operation to each selected group.

但是如何在熊猫中将条件计算作为矢量化操作应用? 一种技巧是根据您的条件选择和分组数据框架,然后将向量化操作应用于每个选定的组。

In this next example, you will see how to select rows with Pandas’ .isin() method and then apply the appropriate tariff in a vectorized operation. Before you do this, it will make things a little more convenient if you set the date_time column as the DataFrame’s index:

在下一个示例中,您将看到如何使用Pandas的.isin()方法选择行,然后在矢量化操作中应用适当的费率。 在执行此操作之前,如果将date_time列设置为DataFrame的索引,它将使事情变得更加方便:

 dfdf .. set_indexset_index (( 'date_time''date_time' , , inplaceinplace == TrueTrue )

)

@timeit@timeit (( repeatrepeat == 33 , , numbernumber == 100100 )
)
def def apply_tariff_isinapply_tariff_isin (( dfdf ):
    ):
    # Define hour range Boolean arrays
    # Define hour range Boolean arrays
    peak_hours peak_hours = = dfdf .. indexindex .. hourhour .. isinisin (( rangerange (( 1717 , , 2424 ))
    ))
    shoulder_hours shoulder_hours = = dfdf .. indexindex .. hourhour .. isinisin (( rangerange (( 77 , , 1717 ))
    ))
    off_peak_hours off_peak_hours = = dfdf .. indexindex .. hourhour .. isinisin (( rangerange (( 00 , , 77 ))

    ))

    # Apply tariffs to hour ranges
    # Apply tariffs to hour ranges
    dfdf .. locloc [[ peak_hourspeak_hours , , 'cost_cents''cost_cents' ] ] = = dfdf .. locloc [[ peak_hourspeak_hours , , 'energy_kwh''energy_kwh' ] ] * * 28
    28
    dfdf .. locloc [[ shoulder_hoursshoulder_hours ,, 'cost_cents''cost_cents' ] ] = = dfdf .. locloc [[ shoulder_hoursshoulder_hours , , 'energy_kwh''energy_kwh' ] ] * * 20
    20
    dfdf .. locloc [[ off_peak_hoursoff_peak_hours ,, 'cost_cents''cost_cents' ] ] = = dfdf .. locloc [[ off_peak_hoursoff_peak_hours , , 'energy_kwh''energy_kwh' ] ] * * 12
12

Let’s see how this compares:

让我们来看看这是如何比较的:

To understand what’s happening in this code, you need to know that the .isin() method is returning an array of Boolean values that looks like this:

要了解这段代码中发生的事情,您需要知道.isin()方法返回的布尔值数组如下所示:

 [[ FalseFalse , , FalseFalse , , FalseFalse , , ...... , , TrueTrue , , TrueTrue , , TrueTrue ]
]

These values identify which DataFrame indices (datetimes) fall within the hour range specified. Then, when you pass these Boolean arrays to the DataFrame’s .loc indexer, you get a slice of the DataFrame that only includes rows that match those hours. After that, it is simply a matter of multiplying the slice by the appropriate tariff, which is a speedy vectorized operation.

这些值标识哪些DataFrame索引(日期时间)在指定的小时范围内。 然后,当您将这些布尔数组传递给DataFrame的.loc索引器时,您会得到DataFrame的一部分,其中仅包含与那些小时匹配的行。 之后,只需将分片乘以适当的费率即可,这是一种快速的矢量化操作。

How does this compare to our looping operations above? Firstly, you may notice that you no longer need apply_tariff(), because all the conditional logic is applied in the selection of the rows. So there is a huge reduction in the lines of code you have to write and in the Python code that is called.

这与我们上面的循环操作相比如何? 首先,您可能会注意到不再需要apply_tariff() ,因为所有条件逻辑都应用于行的选择中。 因此,您必须编写的代码行和所调用的Python代码大大减少了。

What about the processing time? 315 times faster than the loop that wasn’t Pythonic, around 71 times faster than .iterrows() and 27 times faster that .apply(). Now you are moving at the kind of speed you need to get through big data sets nice and quickly.

处理时间呢? 比循环,这不是Python化,更快的周围71倍的速度比315倍.iterrows()和27倍的速度即.apply() 现在,您正以一种需要快速而又准确地访问大数据集的速度前进。

我们可以做得更好吗? (Can We Do Better?)

In apply_tariff_isin(), we are still admittedly doing some “manual work” by calling df.loc and df.index.hour.isin() three times each. You could argue that this solution isn’t scalable if we had a more granular range of time slots. (A different rate for each hour would require 24 .isin() calls.) Luckily, you can do things even more programmatically with Pandas’ pd.cut() function in this case:

apply_tariff_isin() ,我们仍然可以通过分别调用df.locdf.index.hour.isin() 3次来做一些“手工工作”。 您可能会争辩说,如果我们的时隙范围更细,则该解决方案是不可扩展的。 (每小时的费率不同,则需要24个.isin()调用。)幸运的是,在这种情况下,您可以使用Pandas的pd.cut()函数以编程方式进行更多操作:

Let’s take a second to see what’s going on here. pd.cut() is applying an array of labels (our costs) according to which bin each hour belongs in. Note that the include_lowest parameter indicates whether the first interval should be left-inclusive or not. (You want to include time=0 in a group.)

让我们花点时间看看这里发生了什么。 pd.cut()正在根据每个小时所属的仓位应用标签(我们的成本)数组。请注意, include_lowest参数指示第一个时间间隔是否应为左数。 (您希望将time=0包含在组中。)

This is a fully vectorized way to get to your intended result, and it comes out on top in terms of timing:

这是一种完全矢量化的方法,可以达到预期的结果,并且在时间安排方面排在首位:

 >>> >>>  apply_tariff_cutapply_tariff_cut (( dfdf )
)
Best of 3 trials with 100 function calls per trial:
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_cut` ran in average of 0.003 seconds.
Function `apply_tariff_cut` ran in average of 0.003 seconds.

So far, you’ve built up from taking potentially over an hour to under a second to process the full 300-site dataset. Not bad! There is one last option, though, which is to use NumPy functions to manipulate the underlying NumPy arrays for each DataFrame, and then to integrate the results back into Pandas data structures.

到目前为止,您已经花费了从一个多小时到不到一秒钟的时间来处理整个300个站点的数据集。 不错! 不过,还有最后一个选择,就是使用NumPy函数来操纵每个DataFrame的基础NumPy数组,然后将结果重新集成到Pandas数据结构中。

不要忘记NumPy! (Don’t Forget NumPy!)

One point that should not be forgotten when you are using Pandas is that Pandas Series and DataFrames are designed on top of the NumPy library. This gives you even more computation flexibility, because Pandas works seamlessly with NumPy arrays and operations.

使用Pandas时,不应忘记的一点是Pandas Series和DataFrames是在NumPy库的顶部设计的。 由于Pandas可与NumPy数组和操作无缝地协同工作,因此可以为您提供更大的计算灵活性。

In this next case you’ll use NumPy’s digitize() function. It is similar to Pandas’ cut() in that the data will be binned, but this time it will be represented by an array of indexes representing which bin each hour belongs to. These indexes are then applied to a prices array:

在下一种情况下,您将使用NumPy的digitize()函数。 它与Pandas的cut()相似,因为将对数据进行装箱,但是这次将由一组索引来表示,这些索引表示每小时属于哪个箱。 然后将这些索引应用于价格数组:

Like the cut() function, this syntax is wonderfully concise and easy to read. But how does it compare in speed? Let’s see:

cut()函数一样,此语法非常简洁且易于阅读。 但是它的速度如何比较? 让我们来看看:

 >>> >>>  apply_tariff_digitizeapply_tariff_digitize (( dfdf )
)
Best of 3 trials with 100 function calls per trial:
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_digitize` ran in average of 0.002 seconds.
Function `apply_tariff_digitize` ran in average of 0.002 seconds.

At this point, there’s still a performance improvement, but it’s becoming more marginal in nature. This is probably a good time to call it a day on hacking away at code improvement and think about the bigger picture.

在这一点上,仍然有性能上的改进,但是本质上却变得越来越微不足道。 这可能是个好时机,可以整日浪费大量精力进行代码改进,并思考更大的前景。

With Pandas, it can help to maintain “hierarchy,” if you will, of preferred options for doing batch calculations like you’ve done here. These will usually rank from fastest to slowest (and most to least flexible):

使用Pandas,可以帮助您维护“层次结构”(如您所愿),就像您在此处所做的那样,可以进行批处理计算的首选选项。 这些通常会从最快到最慢(最灵活到最不灵活)排列:

  1. Use vectorized operations: Pandas methods and functions with no for-loops.
  2. Use the .apply() method with a callable.
  3. Use itertuples(): iterate over DataFrame rows as namedtuples from Python’s collections module.
  4. Use iterrows(): iterate over DataFrame rows as (index, pd.Series) pairs. While a Pandas Series is a flexible data structure, it can be costly to construct each row into a Series and then access it.
  5. Use “element-by-element” for loops, updating each cell or row one at a time with df.loc or df.iloc.
  1. 使用向量化运算:没有for循环的Pandas方法和函数。
  2. .apply()方法与可调用对象一起使用。
  3. 使用itertuples()叠代数据帧行作为namedtuples从Python的collections模块。
  4. 使用iterrows() :以(index, pd.Series )对的形式遍历DataFrame行。 虽然Pandas Series是一种灵活的数据结构,但将每一行构造为Series然后访问它可能会很昂贵。
  5. 对循环使用“逐个元素”,使用df.locdf.iloc一次更新每个单元格或第一行。

Don’t Take My Word For It: The order of precedence above is a suggestion straight from a core Pandas developer.

不要相信我:上面的优先顺序是熊猫核心开发人员的直接建议。

Here’s the “order of precedence” above at work, with each function you’ve built here:

这是上面工作中的“优先顺序”,其中包含您在此处构建的每个功能:

Function 功能 Runtime (seconds) 运行时间(秒)
apply_tariff_loop()apply_tariff_loop() 3.152 3.152
apply_tariff_iterrows()apply_tariff_iterrows() 0.713 0.713
apply_tariff_withapply()apply_tariff_withapply() 0.272 0.272
apply_tariff_isin()apply_tariff_isin() 0.010 0.010
apply_tariff_cut()apply_tariff_cut() 0.003 0.003
apply_tariff_digitize()apply_tariff_digitize() 0.002 0.002

防止使用HDFStore重新处理 (Prevent Reprocessing with HDFStore)

Now that you have looked at quick data processes in Pandas, let’s explore how to avoid reprocessing time altogether with HDFStore, which was recently integrated into Pandas.

现在,您已经了解了Pandas中的快速数据处理,现在让我们探索如何与最近集成到Pandas中的HDFStore一起避免重新处理时间。

Often when you are building a complex data model, it is convenient to do some pre-processing of your data. For example, if you had 10 years of minute-frequency electricity consumption data, simply converting the date and time to datetime might take 20 minutes, even if you specify the format parameter. You really only want to have to do this once, not every time you run your model, for testing or analysis.

通常,在构建复杂的数据模型时,对数据进行一些预处理非常方便。 例如,如果您有10年的分钟频率耗电量数据,即使您指定format参数,将日期和时间简单地转换为datetime也可能需要20分钟。 您确实只需要执行一次此操作,而不是每次运行模型进行测试或分析。

A very useful thing you can do here is pre-process and then store your data in its processed form to be used when needed. But how can you store data in the right format without having to reprocess it again? If you were to save as CSV, you would simply lose your datetime objects and have to re-process it when accessing again.

您可以在此处执行的一项非常有用的操作是进行预处理,然后将其数据以已处理的形式存储,以在需要时使用。 但是,如何以正确的格式存储数据而不必再次处理呢? 如果要另存为CSV,则只会丢失日期时间对象,而在再次访问时必须重新处理它。

Pandas has a built-in solution for this which uses HDF5 , a high-performance storage format designed specifically for storing tabular arrays of data. Pandas’ HDFStore class allows you to store your DataFrame in an HDF5 file so that it can be accessed efficiently, while still retaining column types and other metadata. It is a dictionary-like class, so you can read and write just as you would for a Python dict object.

Pandas为此提供了一个内置解决方案,该解决方案使用HDF5 (一种高性能存储格式,专门用于存储表格数据数组)。 Pandas的HDFStore类使您可以将DataFrame存储在HDF5文件中,以便可以有效地访问它,同时仍保留列类型和其他元数据。 它是一个类似于字典的类,因此您可以像对Python dict对象一样进行读写。

Here’s how you would go about storing your pre-processed electricity consumption DataFrame, df, in an HDF5 file:

这是将经过预处理的用电量DataFrame df存储在HDF5文件中的方法:

Now you can shut your computer down and take a break knowing that you can come back and your processed data will be waiting for you when you need it. No reprocessing required. Here’s how you would access your data from the HDF5 file, with data types preserved:

现在,您可以关闭计算机并稍作休息,知道您可以回来了,并且在需要时处理后的数据将在等您。 无需重新处理。 以下是从HDF5文件访问数据的方法,其中保留了数据类型:

 # Access data store
# Access data store
data_store data_store = = pdpd .. HDFStoreHDFStore (( 'processed_data.h5''processed_data.h5' )

)

# Retrieve data using key
# Retrieve data using key
preprocessed_df preprocessed_df = = data_storedata_store [[ 'preprocessed_df''preprocessed_df' ]
]
data_storedata_store .. closeclose ()
()

A data store can house multiple tables, with the name of each as a key.

一个数据存储区可以容纳多个表,每个表的名称作为键。

Just a note about using the HDFStore in Pandas: you will need to have PyTables >= 3.0.0 installed, so after you have installed Pandas, make sure to update PyTables like this:

只是有关在Pandas中使用HDFStore的说明:您将需要安装PyTables> = 3.0.0,因此,在安装Pandas之后,请确保像这样更新PyTables:

结论 (Conclusions)

If you don’t feel like your Pandas project is fast, flexible, easy, and intuitive, consider rethinking how you’re using the library.

如果您不觉得Pandas项目快速灵活简单直观 ,请考虑重新考虑如何使用该库。

The examples you’ve explored here are fairly straightforward but illustrate how the proper application of Pandas features can make vast improvements to runtime and code readability to boot. Here are a few rules of thumb that you can apply next time you’re working with large data sets in Pandas:

您在此处探索的示例相当简单,但是说明了正确应用Pandas功能如何可以极大地改善运行时和启动代码的可读性。 以下是一些经验法则,下次在Pandas中使用大型数据集时可以应用:

  • Try to use vectorized operations where possible rather than approaching problems with the for x in df... mentality. If your code is home to a lot of foor-lops, it might be better suited to working with native Python data structures, because Pandas otherwise comes with a lot of overhead.

  • If you have more complex operations where vectorization is simply impossible or too difficult to work out efficiently, use the .apply() method.

  • If you do have to loop over your array (which does happen), use iterrows() to improve speed and syntax.

  • Pandas has a lot of optionality, and there are almost always several ways to get from A to B. Be mindful of this, compare how different routes perform, and choose the one that works best in the context of your project.

  • Once you’ve got a data cleaning script built, avoid reprocessing by storing your intermediate results with HDFStore.

  • Integrating NumPy into Pandas operations can often improve speed and simplify syntax.

  • 在可能的情况下,尝试使用向量化运算 ,而不要解决for x in df...问题。 如果您的代码有很多地方,那么它可能更适合使用本机Python数据结构,因为Pandas否则会带来很多开销。

  • 如果您具有更复杂的操作,而矢量化根本不可能或太难以至于无法有效地进行计算,请使用.apply()方法。

  • 如果确实需要遍历数组(确实发生了),请使用iterrows()来提高速度和语法。

  • 熊猫有很多选择,从A到B几乎总是有几种方法。请记住这一点,比较不同路线的执行方式,然后选择在您的项目范围内效果最佳的路线。

  • 一旦构建了数据清理脚本,请通过将中间结果存储在HDFStore中来避免进行重新处理。

  • 将NumPy集成到Pandas操作中通常可以提高速度并简化语法。

翻译自: https://www.pybloggers.com/2018/07/fast-flexible-easy-and-intuitive-how-to-speed-up-your-pandas-projects/

熊猫vr

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值