js 子集的子集_准备,设置,子集

js 子集的子集

数据科学周 (That Week in Data Science)

Even if you are fairly new to working with the pandas software library for data science, you likely already know how versatile it can be for searching and selecting within datasets. You may also have come upon the need to narrow your view, from a large number of columns or rows to a more targeted subset of data.

即使您刚开始使用熊猫科学数据科学软件库,您也可能已经知道它在数据集中进行搜索和选择的用途广泛。 您可能还需要缩小视图范围,从大量的列或行到更具针对性的数据子集。

This article will take a look at how we use methods built into pandas to easily view data subsets.

本文将介绍如何使用熊猫内置的方法轻松查看数据子集。

索引器 (Indexers)

The .loc and .iloc indexers make it possible to explore subsets of data (in columns, rows, summary groups…) and to quickly adjust and manipulate those subsets and views on a variety of levels. Indexers are often used when cleaning data that has just been read into pandas or to drill-down to level best suited for visual exploration. [We will not delve into plotting in this article, as the number of code examples included should give us more than enough to look at.]

.loc.iloc索引器使探索数据的子集(在列,行,摘要组等中)以及在各种级别上快速调整和操作这些子集和视图成为可能。 在清理刚读入熊猫的数据或向下钻取到最适合视觉探索的水平时,通常使用索引器。 [我们不会在本文中进行深入研究,因为其中包含的代码示例数量应足以使我们看到更多内容。]

Given an appropriate data source, you can use indexers to thoughtfully compose a single line of Python code to limit your view of a dataframe. Before we take a look at some examples, we can consider the following:

给定适当的数据源后,您可以使用索引器来精心组合一行Python代码,以限制对数据框的查看。 在查看一些示例之前,我们可以考虑以下内容:

  • Given a dataframe listing the name, height, weight, active years, first professional year played, last professional year played, jersey number, yearly contract value, and shoe-size for athletes in the NBA since 1975, you may only want to view jersey numbers for players who were active in 2006.

    给定一个数据框,其中列出了自1975年以来NBA运动员的姓名,身高,体重,活跃年,上一个职业年份,上一个职业年份,球衣号码,年度合同价值和鞋码,您可能只想查看球衣2006年活跃球员的数字。
  • Or, you might identify playgrounds with a swing-set located within one mile of a city’s waterfront, from a dataframe of all the cities public spaces (given that the data includes distance from the waterfront or a way to calculate it).

    或者,您可能会从所有城市公共空间的数据框(假设数据包含距海滨的距离或进行计算的方式)中,识别出位于城市滨水区一英里以内的Swing游乐场。

Obviously, options will depend on your dataset. My point is that indexers provide a lot of opportunity to gain better views over data. One of the most important reasons for their flexibility lies right in front of our eyes.

显然,选项将取决于您的数据集。 我的观点是,索引器提供了很多机会来更好地查看数据。 它们灵活的最重要原因之一就在我们眼前。

Image of two square-top cafe tables placed together to make one longer table.
Image credit. 形象功劳

Some of pandas’s built-in methods enable you to select subsets of data and return the subsets themselves as dataframe objects. That means you can operate on them in the same ways you might on any other dataframe to gain meaningful insight. Let’s see what that looks like, with a few excerpts from a python notebook I built for just this purpose.

大熊猫的某些内置方法使您能够选择数据子集,并将这些子集本身作为数据框对象返回。 这意味着您可以像对待其他任何数据框一样使用它们来获得有意义的见解。 让我们看看它的外观,并摘录我为此目的而构建的python笔记本的一些摘录。

程式码范例 (Code Examples)

I will use a data set from a 2014 fivethirtyeight repository containing data and code for a story on the economics of picking a college major. The ‘recent-grads.csv’ file contains a detailed breakdown of labor force information for those in the workforce that have completed graduate school.

我将使用2014年第五十八 资料库中的数据集,其中包含有关选择大学专业的经济学故事的数据和代码。 “ recent-grads.csv”文件中包含完成研究生教育的劳动力的详细信息。

We use the ‘df’ variable to instantiate the csv data as a pandas dataframe, which we read directly into our jupyter notebook (Google Collaboratory, in this case).

我们使用'df'变量将cv数据实例化为pandas数据框,我们直接将其读入jupyter笔记本(在本例中为Google Collaboratory)。

First, we’ll get an idea of how many rows and columns we are dealing with:

首先,我们将了解要处理的行和列数:

Since .shape returns a tuple, in this case with 173 in the 0th position and 20 in the 1st position, we can retrieve each of its values by index.

由于.shape返回一个元组 ,在这种情况下,第0个位置为173,第一个位置为20,因此我们可以通过索引检索其每个值。

Checking the first few rows using python’s built-in .head() function returns a dataframe object.

使用python内置的.head()函数检查前几行会返回一个数据对象。

These are the first five rows of our ‘df’ dataframe.

这些是我们的“ df”数据帧的前五行。

Inspecting the data, we see that the dataset starts off with a lot of engineering majors. I am already curious as to how many of the entries are from people who majored in engineering. That is a question we can answer rather easily with Pandas.

检查数据,我们发现数据集始于许多工程专业。 我已经好奇有多少人来自工程专业的人。 我们可以很轻松地用熊猫回答这个问题。

Let’s stay in the shallow end of this concept. Just as we viewed the first few rows, using the .head() function, we can view the last few rows using .tail(). Let’s inspect the last two rows, only:

让我们停留在这个概念的底端。 正如我们使用.head()函数查看前几行一样,我们可以使用.tail()查看后几行。 让我们仅检查最后两行:

Clearly we are not only dealing with engineers.

显然,我们不仅在与工程师打交道。

We can dip our toes a little deeper and recreate .tail() as follows, using the .iloc indexing method:

我们可以将脚趾更深一些,并使用如下创建.tail()iloc索引方法:

Since the indexer returns a dataframe subset, we can also use an indexer to subset even further. For example, if we only want the ‘Major_code’, ‘major’, and ‘Total’ columns of those same, last two rows:

由于索引器返回一个数据帧子集,因此我们也可以使用索引器来进一步子集化。 例如,如果我们只希望它们的'Major_code','major'和'Total'列相同,则最后两行:

Note: ‘Rank’ is returned as an index and is not actually part of the request.

注意:“排名”作为索引返回,实际上不是请求的一部分。

We used .iloc to slice the dataframe and access all rows from the .tail() function (using blank spaces on either side of the first colon to get every row from the beginning and every row through the end) and columns from 0 up-to-but-not-including 3.

我们用了 。 iloc对数据帧进行切片 ,并从.tail()函数访问所有行(使用第一个冒号两侧的空格获取从开始到结束的每一行)和从0开始直到-不包括3。

Image of a hand holding a clear bowl with two quarters of melon that have been sliced into segments, resting on their rinds.
Photo by AI FENG Hsiung on Unsplash
AI FENG Hung的照片在《 Unsplash》上

Recall that the purpose of this example is to show that you can operate on these built-in functions and methods just as you would any other dataframe.

回想一下,此示例的目的是表明您可以像处理其他任何数据框一样对这些内置函数和方法进行操作。

Granted, these examples are not necessarily simpler. Let’s see what it looks like when we retrieve the same data, without .tail():

当然,这些例子不一定简单。 让我们看看在没有.tail()的情况下检索相同数据时的外观:

例外情况 (Exceptions)

Not all built-in functions / methods return dataframe objects. Python’s .info() will return a vertical list of column names along with the object type of each column:

并非所有内置函数/方法都返回数据框对象。 Python的.info()将返回垂直的列名列表以及每列的对象类型:

The .info() method returns what is essentially a printout. It is not a dataframe object. If we check its type, it will print out the same info.

.info()方法返回实质上是打印输出的内容。 它不是数据框对象。 如果我们检查其类型 ,它将打印出相同的信息。

At least that time it admits to being ‘NoneType’, so we clearly cannot perform any operations on it.

至少在那个时候它承认是'NoneType',所以我们显然不能对其执行任何操作。

We learn from .info() that all of the columns in the dataframe are numeric, with the exception of ‘Major_category’. That bring us to our next function, which frankly is the real reason we looked at .info().

我们从.info()中了解到,数据框中的所有列均为数字,“ Major_category”除外。 这使我们进入了下一个函数,坦率地说,这是我们查看.info()的真正原因。

。描述() (.describe())

Python’s .describe() returns basic statistics for a dataframe object.

Python的.describe()返回数据对象的基本统计信息。

If you think the output of the describe() function looks a lot like a dataframe, you are correct. And we can access each of its columns with dot-notation:

如果您认为describe()函数的输出看起来很像一个数据帧,那么您是正确的。 我们可以用点符号访问其每一列:

…Or, we can do what we came here for and access information from the dataframe’s .describe() function, using our indexers.

…或者,我们可以执行此处的操作,并使用索引器从数据.describe()函数访问信息。

We retrieved only the 25th percentile statistic from the dataframe’s fifth -row (index number 4), and we found that statistic by requesting columns from the start of the dataframe up-to-but-not-including index 2.

我们仅从数据框的第五行(索引号4)中检索了第25个百分位统计量,并且通过从数据框的开头请求直到(但不包括)索引2的列来发现该统计量。

Of course, we can view the same data using .loc:

当然,我们可以使用查看相同的数据。 位置

How’s that for idosynchratic granularity, eh?

对于等位粒度,该如何?

Ehem!

hem!

If you recall the row labels, you can limit your view to a specific subset of statistics:

如果您回忆起行标签,则可以将视图限制为特定的统计信息子集:

And they are returned as dataframe objects; so you can further reduce the views to include specific columns. Below, I skip one column, then two, then one…:

它们作为数据框对象返回; 因此您可以进一步缩小视图以包含特定的列。 在下面,我跳过一列,然后跳过两列,然后跳过…:

Finally, we conclude our sub-setting exercises with a more meaningful and — without too much dificulty, consumable — view of the minimum, mean, and max statistical counts of workforce employed and unemployed, full-time and part-time jobs, jobs requiring and not requiring college education, low-wage service jobs, and the number of men and women in the dataset.

最后,我们以更有意义的子集练习结束我们的子集练习,并且在没有太多困难的情况下使用易耗品,查看雇用和失业的劳动力的最低,平均和最大统计数量,全职和兼职工作,需要进行的工作并且不需要大学教育,低薪服务工作以及数据集中的男女人数。

That’s what you were looking for, right?

那就是您想要的,对吗?

Image credit 图片信誉

结论 (Conclusion)

In this article, we discussed a few ways to subset data in pandas. Slicing dataframes can help analysts focus on particular elements within data. Functions and methods, including .loc and .iloc, are useful for such efforts; they can be operated upon using the same methods as the full dataframe.

在本文中,我们讨论了在熊猫中对数据进行子集化的几种方法。 切片数据框可以帮助分析人员专注于数据中的特定元素。 函数和方法(包括.loc.iloc)对于此类工作很有用; 它们可以使用与整个数据帧相同的方法进行操作。

Sometimes, recognizing the results of built-in python methods as dataframe objects can give you a sharper view of your data. Other times, it might just give you an appreciation for their understated elegance.

有时,将内置python方法的结果识别为dataframe对象可以使您更清晰地查看数据。 有时,它可能只是让您赞赏其低调的优雅。

It only matters how you slice it.

切成薄片只重要。

翻译自: https://medium.com/@jameld.pro/ready-set-subset-666b8ede7024

js 子集的子集

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值