魅族mx5游戏模式小熊猫_您不知道的5大熊猫技巧

最新推荐文章于 2024-08-10 10:46:00 发布

weixin_26746401

最新推荐文章于 2024-08-10 10:46:00 发布

阅读量516

点赞数

文章标签： python java 设计模式

原文链接：https://towardsdatascience.com/5-essential-pandas-tricks-you-didnt-know-about-2d1a5b6f2e7

版权

魅族mx5游戏模式小熊猫

重点 (Top highlight)

I’ve been using pandas for years and each time I feel I am typing too much, I google it and I usually find a new pandas trick! I learned about these functions recently and I deem them essential because of ease of use.

我已经使用熊猫多年了，每次我输入太多单词时，我都会用google搜索它，而且我通常会发现一个新的熊猫技巧！我最近了解了这些功能，并且由于易于使用，我认为它们是必不可少的。

1.功能之间 (1. between function)

I’ve been using “between” function in SQL for years, but I only discovered it recently in pandas.

多年来，我一直在SQL中使用“ between”功能，但最近才在pandas中发现它。

Let’s say we have a DataFrame with prices and we would like to filter prices between 2 and 4.

假设我们有一个带有价格的DataFrame，并且我们希望在2到4之间过滤价格。

df = pd.DataFrame({'price': [1.99, 3, 5, 0.5, 3.5, 5.5, 3.9]})

With between function, you can reduce this filter:

使用between功能，可以减少此过滤器：

df[(df.price >= 2) & (df.price <= 4)]

To this:

对此：

df[df.price.between(2, 4)]

It might not seem much, but those parentheses are annoying when writing many filters. The filter with between function is also more readable.

看起来似乎不多，但是编写许多过滤器时这些括号令人讨厌。具有中间功能的过滤器也更易读。

between function sets interval left <= series <= right.

功能集之间的间隔左<=系列<=右。

2.使用重新索引功能固定行的顺序 (2. Fix the order of the rows with reindex function)

Reindex function conforms a Series or a DataFrame to a new index. I resort to the reindex function when making reports with columns that have a predefined order.

Reindex函数使Series或DataFrame符合新索引。当使用具有预定义顺序的列制作报表时，我求助于reindex函数。

Let’s add sizes of T-shirts to our Dataframe. The goal of analysis is to calculate the mean price for each size:

让我们在数据框中添加T恤的尺寸。分析的目的是计算每种尺寸的平ASP格：

df = pd.DataFrame({'price': [1.99, 3, 5], 'size': ['medium', 'large', 'small']})df_avg = df.groupby('size').price.mean()
df_avg

Sizes have a random order in the table above. It should be ordered: small, medium, large. As sizes are strings we cannot use the sort_values function. Here comes reindex function to the rescue:

尺寸在上表中具有随机顺序。应该订购：小，中，大。由于大小是字符串，因此我们不能使用sort_values函数。这里有reindex函数来解救：

df_avg.reindex(['small', 'medium', 'large'])

By

通过

3.描述类固醇 (3. Describe on steroids)

Describe function is an essential tool when working on Exploratory Data Analysis. It shows basic summary statistics for all columns in a DataFrame.

当进行探索性数据分析时，描述功能是必不可少的工具。它显示了DataFrame中所有列的基本摘要统计信息。

df.price.describe()

What if we would like to calculate 10 quantiles instead of 3?

如果我们想计算10个分位数而不是3个分位数怎么办？

df.price.describe(percentiles=np.arange(0, 1, 0.1))

Describe function takes percentiles argument. We can specify the number of percentiles with NumPy's arange function to avoid typing each percentile by hand.

描述函数采用百分位数参数。我们可以使用NumPy的arange函数指定百分位数，以避免手动键入每个百分位数。

This feature becomes really useful when combined with the group by function:

与group by函数结合使用时，此功能将非常有用：

df.groupby('size').describe(percentiles=np.arange(0, 1, 0.1))

4.使用正则表达式进行文本搜索 (4. Text search with regex)

Our T-shirt dataset has 3 sizes. Let’s say we would like to filter small and medium sizes. A cumbersome way of filtering is:

我们的T恤数据集有3种尺寸。假设我们要过滤中小型尺寸。繁琐的过滤方式是：

df[(df['size'] == 'small') | (df['size'] == 'medium')]

This is bad because we usually combine it with other filters, which makes the expression unreadable. Is there a better way?

这很不好，因为我们通常将其与其他过滤器结合使用，从而使表达式不可读。有没有更好的办法？

pandas string columns have an “str” accessor, which implements many functions that simplify manipulating string. One of them is “contains” function, which supports search with regular expressions.

pandas字符串列具有“ str”访问器，该访问器实现了许多简化操作字符串的功能。其中之一是“包含”功能，该功能支持使用正则表达式进行搜索。

df[df['size'].str.contains('small|medium')]

The filter with “contains” function is more readable, easier to extend and combine with other filters.

具有“包含”功能的过滤器更具可读性，更易于扩展并与其他过滤器组合。

5.比带有熊猫的内存数据集更大 (5. Bigger than memory datasets with pandas)

pandas cannot even read bigger than the main memory datasets. It throws a MemoryError or Jupyter Kernel crashes. But to process a big dataset you don’t need Dask or Vaex. You just need some ingenuity. Sounds too good to be true?

熊猫读取的数据甚至不能超过主内存数据集。它引发MemoryError或Jupyter Kernel崩溃。但是，要处理大型数据集，您不需要Dask或Vaex。 您只需要一些独创性 。听起来好得令人难以置信？

In case you’ve missed my article about Dask and Vaex with bigger than main memory datasets:

如果您错过了我的有关Dask和Vaex的文章，而这篇文章的内容比主内存数据集还大：

When doing an analysis you usually don’t need all rows or all columns in the dataset.

执行分析时，通常不需要数据集中的所有行或所有列。

In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:

在某种情况下，您不需要所有行，您可以按块读取数据集并过滤不必要的行以减少内存使用量：

iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.

分块读取数据集要比一次读取所有数据集慢。我建议仅对大于内存的数据集使用此方法。

In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:

在某种情况下，不需要所有列，可以在读取数据集时使用“ usecols”参数指定所需的列：

df = pd.read_csvsecols=['col1', 'col2'])

The great thing about these two approaches is that you can combine them.

这两种方法的优点在于您可以将它们组合在一起。

你走之前 (Before you go)

These are a few links that might interest you:

这些链接可能会让您感兴趣：

- Your First Machine Learning Model in the Cloud- AI for Healthcare- Parallels Desktop 50% off- School of Autonomous Systems- Data Science Nanodegree Program- 5 lesser-known pandas tricks- How NOT to write pandas code

翻译自: https://towardsdatascience.com/5-essential-pandas-tricks-you-didnt-know-about-2d1a5b6f2e7

魅族mx5游戏模式小熊猫

weixin_26746401

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
魅族mx5游戏模式小熊猫_您不知道的5大熊猫技巧

魅族mx5游戏模式小熊猫重点 (Top highlight)I’ve been using pandas for years and each time I feel I am typing too much, I google it and I usually find a new pandas trick! I learned about these functions recently a...
复制链接

扫一扫