[011] Pandas的隐藏“彩蛋”组件sidetable,极大提升工作效率

哈哈,完成了这个专栏的10篇文章推荐,稍微给自己放了2周的假,今天恢复更新!这次给大家推荐一个Pandas的组件 sidetable,可以极大地提升我们的数据分析效率,主要是有4个方面的功能,频率统计、计数统计、缺失统计、小计功能。我们也可以直接用groupby来实现,但是代码就显得相对复杂了。相信大家阅读完这篇文章的介绍以及代码,可以有一定的帮助哦!

Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Although the built-in functions of Pandas are capable of performing efficient data analysis, custom made functions or libraries add value to Pandas.

Sidetable is one of these add-ons which makes it easier to create summaries of dataframes. It can be considered as a combination of value counts and cross tab functions.

In some cases, sidetable can work as the groupby function. It can also be combined with the groupby function to produce more informative results.

Sidetable was created by Chris Moffitt. It has been quite useful for me in my daily analyses. In this post, I will walk you through examples to show how to best make use of the sidetable.

Once installed, sidetable can be used as an accessor on dataframes just like dt and str accessors. Installation is straightforward.

$  python -m pip install -U sidetable #from terminal!pip install sidetable #jupyter notebook

We can import it along with pandas and start using.

import pandas as pd
import sidetable

I will be using direct marketing and US elections datasets for examples. Both datasets are available on Kaggle.

Marketing dataframe (image by author)

Elections dataframe (image by author)

Sidetable provides functions that are used with the stb accessor. The functions we will cover are:

  • Freq function

  • Counts function

  • Missing function

  • Subtotal function

Freq function

Freq function returns a dataframe that conveys 3 pieces of information.

  • The number of observations (i.e. rows) for each category (value_counts()).

  • The percentage of each category in the entire column (value_counts(normalize=True)).

  • The cumulative versions of the two above.

Here is an example.

marketing.stb.freq(['Age'])

The “Age” column has three categories (Middle, Young, Old). For each category, we see the number of rows and percentage. The rows in cumulative columns contain these values up to that row. For instance, the second row of cumulative columns shows the count and percentage of the middle and young categories.

The freq function counts the number of rows by default. If we pass another column using the value parameter, it will return the sum of values in that column. Let’s do an example.

marketing.stb.freq(['Age'], value='AmountSpent')

As you can see, the name of the column changed from “count” to the name of the column passed to the value parameter. What we see in the returned table is the sum of the “AmountSpent” column for each category. The other columns contain the data (percentage, cumulative) based on the values in the “AmountSpent” column.

The freq function can also take multiple columns as argument. It is similar to the groupby function with the count method.

marketing.stb.freq(['Age','Gender'])

We have have 6 categories which are the combinations of categories in the “Age”and “Gender” columns. Another useful feature of sidetable is that the values are sorted by default.

We can achieve the same result (except for the cumulative part) with the groupby function.

marketing[['Age','Gender','Salary']]\
.groupby(['Age','Gender'], as_index=False)\
.count().sort_values(by='Salary', ascending=False)\
.rename(columns={'Salary':'count'})

It is clear that sidetable provides a much simpler syntax.

One advantage of having cumulative values is that we can only display the larger categories.

Let’s do an example on the elections dataset. We want to see the total number of votes in the states that constitute the %40 of all votes.

elections.stb.freq(['state'], value='total_votes', thresh=40)

The states are sorted based on the total number of votes. When the cumulative percent reach 0.40, remaining states are represented in one row and labelled as “others”. We can change the label name by using the other_label parameter.

Counts function

Another highly useful function of sidetable is the count function. It returns the number of unique values in each column along with some other measures.

  • The number of non-missing values in each column

  • The number of unique categories in each column

  • The most and least frequent categories in each column

  • The number of values that belong the most and least frequent columns

Let’s apply it on the marketing dataframe.

marketing.stb.counts()

It is a quite informative table. We can see the number of unique values, the most and least frequent categories.

As you can see, the table includes all the features. We can select a specific data type using the exclude or include parameters. For instance, the following syntax will exclude the numeric columns.

marketing.stb.counts(exclude='number')

Missing function

The missing function is pretty simple. It returns the count and percentage of missing values in each column.

marketing.stb.missing()

This dataframe does not have many missing values. However, it comes in handy when we work with dataframes that contain missing values in most columns.

Subtotal function

The subtotal function is best used with the groupby function of Pandas. It adds a subtotal for levels of the grouping.

Let’s first do a groupby example without the subtotal function of sidetable.

marketing[['Age','OwnHome','AmountSpent']]\
.groupby(['Age','OwnHome']).sum()

We have 2 levels and 6 categories as the result of grouping. The levels are the “Age” and “OwnHome” columns. For each category, the sum of the “AmountSpent” column is shown. In some cases, it would be better to also see the sub total for the levels.

Adding subtotals of levels are pretty simple with the sidetable.

marketing[['Age','OwnHome','AmountSpent']]\
.groupby(['Age','OwnHome']).sum()\
.stb.subtotal()

In addition to the subtotals, we also see the grand total for the aggregated columns.

If we have more than two levels, the subtotals will be added to each level except for the last one. However, it can be changed using the sub_level parameter.

Let’s assume we have 3 levels (Age, OwnHome, Gender) in the groupby function:

  • sub_level = 1 : Subtotals for categories in Age column are shown

  • sub_level = 2 : Subtotals for categories in OwnHome column are shown

  • sub_level = [1,2] : All subtotals are shown.

Conclusion

Sidetable is a great tool to create summary tables which are quite useful in exploratory data analysis. We can also use them to deliver analyses results.

What sidetable offers can also be created using the Pandas own functions and methods. However, the syntax and simplicity of sidetable makes it the first choice for me in many cases.

Thank you for reading. Please let me know if you have feedback.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值