arcgis简化数据_5个基本的熊猫提示，可简化数据操作

最新推荐文章于 2023-03-28 09:20:29 发布

weixin_26750481

最新推荐文章于 2023-03-28 09:20:29 发布

阅读量284

点赞数

文章标签： python java 算法 vue mysql ViewUI

原文链接：https://towardsdatascience.com/5-essential-pandas-tips-for-easier-data-manipulation-4c2968d37a79

版权

arcgis简化数据

介绍 (Introduction)

Pandas for Python is a piece of software that needs no introduction. Whether you are entirely new to Data Science with Python, or you’ve been in the field for an extended period of time, it’s likely you’ve heard of the Pandas Python module. The library is widely used in the industry for data manipulation, and is a go to tool for any aspiring Data Scientist who wants to work with data-frames and Numpy. Many Data Scientists use Pandas every single day, and it is widely considered an essential tool for manipulating data with Python.

P的ANDA为Python是一块是不需要介绍的软件。无论您是完全不熟悉使用Python进行数据科学，还是已经在该领域工作了很长时间，您都可能听说过Pandas Python模块。该库在行业中被广泛用于数据处理，对于想要使用数据框架和Numpy的任何有抱负的数据科学家来说，它都是一个工具。许多数据科学家每天都使用Pandas，Pandas被广泛认为是使用Python处理数据的必不可少的工具。

Although Pandas is rather easy to use, and has a lot of easy methods at its disposal, there are many parts to Pandas, some of which go entirely ignored most of the time. Pandas is a complex beast, and could take months, or even years to be used to its absolute highest level. That being said, there are some basic attributes Pandas claims that can be used effectively in most situations right now.

尽管Pandas相当易于使用，并且有许多简便的方法可供使用，但是Pandas有很多部分，其中大部分在大多数时间都被完全忽略了。熊猫是一种复杂的野兽，要花上几个月甚至几年时间才能达到其绝对的最高水平。话虽如此，Pandas声称有一些基本属性可以立即在大多数情况下有效使用。

有条件的掩蔽 (Conditional Masking)

One attribute that most certainly pushes Pandas above its competitors and the dictionary type itself is conditional masking. Conditional masking allows the user to use a simple conditional statement to filter out values that don’t meet its requirements. This is incredibly convenient, and is leagues above what is available in Julia or R at this moment. Whereas in Julia for example, we would need to use the filter!() method with a conditional in order to manage our data, Pandas makes filtering data incredibly easy by using what is called a conditional mask.

可以肯定的是，Pandas超越其竞争对手，而字典类型本身就是一个条件屏蔽。条件屏蔽允许用户使用简单的条件语句来筛选出不符合其要求的值。这是非常方便的，并且比当前Julia或R中可用的联赛高。例如，在Julia中，我们需要使用带条件的filter！()方法来管理数据，而Pandas通过使用条件屏蔽使过滤数据变得异常容易。

A conditional mask iteratively loops through all of the data in the data-frame and compares the data to a preset condition. The return will be a filtered data-frame that will follow the condition set in the mask.

条件掩码迭代遍历数据帧中的所有数据，并将数据与预设条件进行比较。返回值将是一个经过过滤的数据帧，该数据帧将遵循掩码中设置的条件。

import pandas as pd
df = pd.DataFrame({"NA": [0,1,0,1,0,1], "Label": ["TN", "GA", "TN", "MN", "CA","CA"]})

ones_only = df["NA"] == 1

染色 (Coloring)

Sometimes data can be hard to read. This might not be a big deal for a Data Scientist, but often data needs to be presented and made more legible. Pandas makes it rather easy to change the color of different values inside of a data frame. Let’s consider an example where we want to show which values are below a certain threshold, and which ones fall below it.

有时，数据可能难以读取。对于数据科学家来说，这可能不是什么大问题，但通常需要对数据进行呈现并使其更清晰。通过Pandas，可以轻松更改数据框内不同值的颜色。让我们考虑一个示例，其中我们要显示哪些值低于某个阈值，哪些值低于该阈值。

df = pd.DataFrame({"Store ID": [1,8,7,4,11,2], "Income": [-12, 56, -2, 23, 7, 16]})

First, we will need to write a little function to color our values based on our conditional. For this example, we are going to be mapping the negative values to red to show losses for each store versus gains for other stores.

首先，我们将需要编写一个小函数以根据条件为值着色。对于此示例，我们将负值映射为红色以显示每个商店的亏损与其他商店的收益。

def negatives(input):color = 'red' if input < 0 else 'black'return 'color: %s' % color

This is what is known as a style map. A style map is a function used by Pandas to determine how a data frame should be presented. We can apply a style map with df.style.applymap().

这就是所谓的样式图。样式图是Pandas用于确定应如何呈现数据框的功能。我们可以使用df.style.applymap().应用样式图df.style.applymap().

df.style.applymap(negatives)

Notice that our data frame’s styling did change. Pandas data frames use cascade style sheets for styling, so it is actually incredibly easy to alter. A new style sheet can be applied using df.style.set_table_styles.

请注意，我们数据框的样式确实发生了变化。熊猫数据框使用级联样式表进行样式设置，因此更改起来实际上非常容易。可以使用df.style.set_table_styles.应用新的样式表df.style.set_table_styles.

汇总和交叉表 (Summations and cross-tabulations)

One thing that might be useful in a lot of situations when trying to explore and understand large data-sets is getting a total count for certain columns, which we can apply using lambda.

在尝试探索和理解大数据集时，在许多情况下可能有用的一件事是获取某些列的总数，我们可以使用lambda来应用。

df = pd.DataFrame({"Group A": [7,9,11,12,16], "Group B": [12,14, 3, 7, 2]})

df['total']     = df.apply(lambda x: x.sum(), axis=1)

Similarly, we can use cross-tabulations to accomplish what is essentially the same goal. Cross-tabulations are used to count frequencies in columns, however, whereas the method above provides the summation of the columns.

同样，我们可以使用交叉表来实现本质上相同的目标。交叉表用于计数列中的频率，但是，以上方法提供了列的总和。

cross = pd.crosstab(index = df["Group A"], columns = df["Group B"])

组态 (Configuration)

Another cool thing about Pandas is that it is highly configurable and dynamic. Pandas allows for a creation of a “ configuration file” of sorts at runtime that can be used to change environmental variables to determine how it acts. This can be useful for many different situations. Consider this display configuration, for example:

关于Pandas的另一个很酷的事情是，它具有高度的可配置性和动态性。熊猫允许在运行时创建各种“配置文件”，可用于更改环境变量以确定其行为。这对于许多不同的情况可能很有用。考虑以下显示配置，例如：

def load_config():options = {'display': {'max_columns': None,'max_colwidth': 25,'expand_frame_repr': False,'max_rows': 14,'max_seq_items': 50,'precision': 4,'show_dimensions': False},'mode': {'chained_assignment': None}}for category, option in options.items():for op, value in option.items():pd.set_option(f'{category}.{op}', value)

This is of course done with the set_option method, which takes a category name and a respective value as parameters.

当然，这是通过set_option方法完成的，该方法将类别名称和相应的值作为参数。

存取器 (Accessors)

One great thing that the Series type has to offer is the ability to use accessors. There are four types of accessors in the latest version of Pandas.

Series类型必须提供的一项很棒的功能是使用访问器的能力。最新版本的Pandas中有四种访问器。

str maps to StringMethods.
str映射到StringMethods 。
.dt maps to CombinedDatetimelikeProperties.
.dt映射到CombinedDatetimelikeProperties 。
.cat routes to CategoricalAccessor.
.cat路由到CategoricalAccessor 。

These are all individual standalone classes that are connected to the Series class using a Cached Accessory. They each come with their own unique methods that can be incredibly useful at times. Consider the following example:

这些都是使用缓存附件连接到Series类的单个独立类。他们每个人都有自己独特的方法，这些方法有时会非常有用。考虑以下示例：

locations = pd.Series(['Cleveland, TN 37311','Brooklyn, NY 11211-1755','East Moline, IL 61275','Pittsburgh, PA 15211'])

Let’s say for this example that we wanted to count how many numbers are in each zip code. We could do so with the string accessor like this:

在这个示例中，我们要计算每个邮政编码中的数字。我们可以使用这样的字符串访问器来做到这一点：

locations.str.count(r'\d')

Pretty cool, right?

很酷吧？

结论 (Conclusion)

Pandas is a great library for handling data in Python that has a lot of really useful features that make data manipulation far easier than it would be otherwise. From simple things like accessor classes and conditional masks, to simple styling and fully dynamic option sets, Pandas is a very dynamic library that can be used for a load of different operations. This of course does make Pandas rather hard to compete with. It’s difficult to justify the use of any other software in modern analytics because Pandas is just so optimal!

Pandas是一个很好的Python处理数据库，它具有许多真正有用的功能，这些功能使数据处理比以前容易得多。从访问器类和条件掩码之类的简单事物到简单的样式和完全动态的选项集，Pandas是一个非常动态的库，可用于执行各种不同的操作。当然，这确实使熊猫很难与之竞争。很难证明在现代分析中使用任何其他软件是合理的，因为Pandas是如此之好！