纹个鸡儿天才小熊猫_这三个技巧将使熊猫的记忆效率更高

最新推荐文章于 2025-02-18 17:33:06 发布

weixin_26720761

最新推荐文章于 2025-02-18 17:33:06 发布

阅读量198

点赞数

文章标签： java python 算法

原文链接：https://towardsdatascience.com/these-3-tricks-will-make-pandas-more-memory-efficient-455f9b672e00

版权

纹个鸡儿天才小熊猫

Many Data Analysis tasks are still performed on a laptop. This speeds up the analysis as you have your familiar work environment prepared with all of the tools. But chances are your laptop is not “the latest beast” with x-GB of main memory.

M之外的任何数据分析任务在笔记本电脑上依然执行。使用所有工具准备好熟悉的工作环境后，可以加快分析速度。但是，您的笔记本电脑可能不是具有x-GB主内存的“最新野兽”。

Then a Memory Error surprises you!

然后，内存错误使您感到惊讶！

What should you do? Use Dask? You never work with it and these tools have usually some quirks. Should you request a Spark cluster? Or is a Spark a little exaggerated choice at this point?

你该怎么办？使用Dask？您永远不会使用它，并且这些工具通常会有一些怪癖。您是否应该请求Spark集群？还是Spark在这一点上有点夸张？

Calm down… breathe.

冷静下来……呼吸。

Before you think about using another tool, ask yourself the following question:

在考虑使用其他工具之前，请问自己以下问题：

Do I need all the rows and columns to perform the analysis?

我需要所有行和列来执行分析吗？

提示1：阅读时过滤行 (Tip 1: Filter rows while reading)

In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:

在某种情况下，您不需要所有行，您可以按块读取数据集并过滤不必要的行以减少内存使用量：

iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.

分块读取数据集要比一次读取所有数据集慢。我建议仅对大于内存的数据集使用此方法。

提示2：在阅读时过滤列 (Tip 2: Filter columns while reading)

In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:

在某种情况下，不需要所有列，可以在读取数据集时使用“ usecols”参数指定所需的列：

df = pd.read_csvsecols=['col1', 'col2'])

This approach generaly speeds up reading and reduces the memory consumption. So I would recommend using with every dataset.

通常，此方法可加快读取速度并减少内存消耗。因此，我建议与每个数据集一起使用。

提示3：结合两种方法 (Tip 3: Combine both approaches)

The great thing about these two approaches is that you can combine them. So filtering which rows to read and limiting the number of columns.

这两种方法的优点在于您可以将它们组合在一起。因此，筛选要读取的行并限制列数。

To Step Up Your Pandas Game take a look at my pandas series:

要加强您的熊猫游戏，请看一下我的熊猫系列：

但是我需要所有的列和行进行分析！ (But I need all the columns and rows for the analysis!)

Then you should try Vaex!

然后，您应该尝试Vaex ！

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.

Vaex是一个高性能Python库，用于懒惰的Out-of-Core DataFrame(类似于Pandas)，以可视化和探索大型表格数据集。它可以计算每秒超过十亿行的基本统计信息。它支持多种可视化，允许交互式探索大数据。

你走之前 (Before you go)

I am building an online business focused on Data Science. I tweet about how I’m doing it. Follow me there to join me on my journey.

我正在建立一个专注于数据科学的在线业务。我在推特上介绍了我的做法。跟着我到我的旅程中。

These are a few links that might interest you:

这些链接可能会让您感兴趣：

- Data Science Nanodegree Program - AI for Healthcare- Autonomous Systems- Your First Machine Learning Model in the Cloud- 5 lesser-known pandas tricks- How NOT to write pandas code- Parallels Desktop 50% off