实现klib_使用klib加速数据清理和预处理

本文介绍了如何利用klib库来提升数据清理和预处理的效率,详细讲解了klib的实现过程。
摘要由CSDN通过智能技术生成

实现klib

TL;DRThe klib package provides a number of very easily applicable functions with sensible default values that can be used on virtually any DataFrame to assess data quality, gain insight, perform cleaning operations and visualizations which results in a much lighter and more convenient to work with Pandas DataFrame.

TL; DR klib软件包提供了许多非常容易应用的功能以及合理的默认值,几乎可以在任何DataFrame上使用这些功能来评估数据质量,了解洞察力,执行清洁操作和可视化,从而使工作更轻便,更方便使用Pandas DataFrame。

Over the past couple of months I’ve implemented a range of functions which I frequently use for virtually any data analysis and preprocessing task, irrespective of the dataset or ultimate goal.

在过去的几个月中,我实现了一系列功能,而无论数据集或最终目标如何,我经常将这些功能用于几乎所有数据分析和预处理任务。

These functions require nothing but a Pandas DataFrame of any size and any datatypes and can be accessed through simple one line calls to gain insight into your data, clean up your DataFrames and visualize relationships between features. It is up to you if you stick to sensible, yet sometimes conservative, default parameters or customize the experience by adjusting them according to your needs.

这些函数只需要任何大小和任何数据类型的Pandas DataFrame,就可以通过简单的一行调用来访问它们,以深入了解数据,清理DataFrame并可视化要素之间的关系。 如果您坚持明智但有时还是保守的默认参数,或者根据需要调整体验来定制体验,则取决于您。

This package is not meant to provide an Auto-ML style API. Rather it is a collection of functions which you can — and probably should — call every time you start working on a new project or dataset. Not only for your own understanding of what you are dealing with, but also to produce plots you can show to supervisors, customers or anyone else looking to get a higher level representation and explanation of the data.

该程序包无意提供Auto-ML样式的API。 而是它是函数的集合,您每次开始处理新项目或数据集时都可以(可能应该)调用这些函数。 不仅是为了您自己对正在处理的内容的理解,而且还可以制作出可以显示给主管,客户或希望获得更高级表示和数据解释的任何人的图。

安装说明 (Installation Instructions)

Install klib using pip:

使用pip安装klib:

pip install --upgrade klib

Alternatively, to install with conda run:

或者,使用conda运行:

conda install -c conda-forge klib

What follows is a workflow and set of best practices which I repeatedly apply when facing new datasets.

接下来是工作流程和一组最佳实践,当面对新数据集时,我会反复应用这些最佳实践。

快速大纲 (Quick Outline)

  • Assessing Data Quality

    评估数据质量
  • Data Cleaning

    数据清理
  • Visualizing Relationships

    可视化关系

The data used in this guide is a slightly truncated version of the NFL Dataset found on Kaggle. You can download it here or use any arbitrary data you want to follow along.

本指南中使用的数据是在Kaggle上找到的NFL数据集的略微截短的版本。 您可以在此处下载该文件,也可以使用要跟踪的任意数据。

评估数据质量 (Assessing the Data Quality)

Determining data quality before starting to work on a dataset is crucial. A quick way to achieve that is to use the missing value visualization of klib, which can be called as easy as follows:

在开始使用数据集之前确定数据质量至关重要。 一种快速的实现方法是使用klib的缺失值可视化,可以这样简单地调用它:

Image for post
Default representation of missing values
缺失值的默认表示

This single plot already shows us a number of important things. Firstly, we can identify columns where all or most of the values are missing. These are candidates for dropping, while those with fewer missing values might benefit from imputation.

这个单一的图已经向我们展示了许多重要的事情。 首先,我们可以确定缺少所有或大多数值的列。 这些是丢弃的候选对象,而那些缺失值较少的对象可能会从插补中受益。

Secondly, we can often times see patterns of missing rows stretching across many features. We might want to eliminate them first before thinking about dropping potentially relevant features.

其次,我们经常可以看到缺失行的样式遍布许多要素。 在考虑删除潜在的相关功能之前,我们可能希望先消除它们。

And lastly, the additional statistics at the top and the right side give us valuable information regarding thresholds we can use for dropping rows or columns with many missing values. In our example we can see that if we drop rows with more than 30 missing values, we only lose a few entries. At the same time, if we eliminate columns with missing values larger than 80% the four most affected columns are removed.

最后,顶部和右侧的其他统计信息为我们提供了有关阈值的有价值的信息,可用于删除具有许多缺失值的行或列。 在我们的示例中,我们可以看到,如果我们删除缺失值超过30个的行,则只会丢失一些条目。 同时,如果我们删除缺失值大于80%的列,则会删除四个受影响最大的列。

A quick note on performance: Despite going through about 2 million entries with 66 features each, the plot takes only seconds to create.

关于性能的简要说明:尽管要经历大约200万个条目,每个条目具有66个特征,但创建图仅需几秒钟。

数据清理 (Data Cleaning)

With this insight, we can go ahead and start cleaning the data. With klib this is as simple as calling klib.data_cleaning(), which performs the following operations:

有了这种洞察力,我们就可以开始清理数据。 使用klib,就像调用klib.data_cleaning()一样简单,该函数执行以下操作:

  • cleaning the column names:This unifies the column names by formatting them, splitting, among others, CamelCase into camel_case, removing special characters as well as leading and trailing white-spaces and formatting all column names to lowercase_and_underscore_separated. This also checks for and fixes duplicate column names, which you sometimes get when reading data from a file.

    清理列名称:通过格式化列名称,将CamelCase拆分为camel_case,除掉特殊字符以及前导和尾随空格以及将所有列名称格式化为小写的方式来统一列名称,将它们命名为lowercase_and_underscore_separated 。 这还将检查并修复重复的列名,当您从文件中读取数据时有时会得到这些重复的列名。

  • dropping empty and virtually empty columns:You can use the parameters drop_threshold_cols and drop_threshold_rows to adjust the dropping to your needs. The default is to drop columns and rows with more than 90% of the values missing.

    删除空列和几乎为空的列:可以使用参数drop_threshold_colsdrop_threshold_rows来调整删除以适应您的需要。 默认设置是删除缺少90%以上值的列和行。

  • removes single valued columns:As the name states, this removes columns in which each cell contains the same value. This comes in handy when columns such as “year” are included while you’re just looking at a single year. Other examples are “download_date” or indicator variables which are identical for all entries.

    删除单值列:顾名思义,这将删除其中每个单元格包含相同值的列。 当您仅查看一年时,如果包括“年”之类的列,这将很方便。 其他示例是“ download_date”或指标变量,它们对于所有条目都相同。

  • drops duplicate rows:This is a straightforward drop of entirely duplicate rows. If you are dealing with data where duplicates add value, consider setting drop_duplicates=False.

    删除重复的行:这是直接删除完全重复的行。 如果要处理重复项可增加价值的数据,请考虑设置drop_duplicates = False。

  • Lastly, and often times most importantly, especially for memory reduction and therefore for speeding up the subsequent steps in your workflow, klib.data_cleaning() also optimizes the datatypes as we can see below.

    最后,而且通常是最重要的,尤其是对于减少内存 ,从而加快工作流程中的后续步骤, klib.data_cleaning()优化了数据类型 ,如下所示。

Shape of cleaned data: (183337, 62) - Remaining NAs: 1754608
Changes:
Dropped rows: 123
of which 123 duplicates. (Rows: [22257, 25347, 26631, 30310, 33558, 35164, 35777, ..., 182935, 182942, 183058, 183368, 183369])
Dropped columns: 4
of which 1 single valued. (Column: ['play_attempted'])
Dropped missing values: 523377
Reduced memory by at least: 63.69 MB (-68.94%)

You can change the verbosity of the output using the parameter show=None, show=”changes” or show=”all”. Please note that the memory reduction indicates a very conservative value (i.e. less reduction than is actually achieved), as it only performs a shallow memory check. A deep memory analysis slows down the function for larger datasets but if you are curious about the “true” reduction in size you can use the df.info() method as shown below.

您可以使用参数show = Noneshow =” changes”或show =” all”来更改输出的详细程度。 请注意,内存减少表示一个非常保守的值(即减少的数量少于实际实现的数量),因为它仅执行浅内存检查。 深度内存分析会使大型数据集的功能变慢,但是如果您对大小的“真实”减少感到好奇,则可以使用df.info()方法,如下所示。

df.info(memory_usage='deep')dtypes: float64(25), int64(20), object(21)
memory usage: 256.7 MB

As we can see, pandas assigns 64 bits of storage for each float and int. Additionally, 21 columns are of type “object”, which is a rather inefficient way to store data. After data cleaning, the memory usage drops to only 58.4 MB, a reduction of almost 80%! This is achieved by converting, where possible, float64 to float32, and int64 to int8. Also, the dtypes string and category are utilized. The available parameters such as convert_dtypes, category, cat_threshold and many more allow you to tune the function to your needs.

如我们所见,pandas为每个float和int分配了64位存储空间。 另外,有21列是“对象”类型的,这是存储数据的一种非常低效的方式。 清除数据后, 内存使用量下降到只有58.4 MB,减少了近80%! 这是通过在可能的情况下将float64转换为float32 ,并将int64转换int8来实现的 。 同样,使用dtypes 字符串类别 。 可用的参数(如convert_dtypes,categorycat_threshold等)可让您根据需要调整函数。

df_cleaned.info(memory_usage='deep')dtypes: category(17), float32(25), int8(19), string(1)
memory usage: 58.4 MB

Lastly, we take a look at the column names, which were actually quite well formatted in the original dataset already. However, after the cleaning process, you can rely on lowercase and underscore-connected column names. While not advisable to avoid ambiguity, this now allows you to use df.yards_gained instead of df[“Yards.Gained”], which can be really useful when doing quick lookups or when exploring the data for the first time.

最后,我们来看一下列名,这些列名实际上已经在原始数据集中格式化了。 但是, 在清理过程之后,您可以依赖小写字母和下划线连接的列名 。 尽管不建议您避免歧义,但现在允许您使用df.yards_gained而不是df [“ Yards.Gained”] ,这在进行快速查找或首次浏览数据时非常有用。

Some column name examples:
Yards.Gained --> yards_gained
PlayAttempted --> play_attempted
Challenge.Replay --> challenge_replay

Ultimately, and to sum it all up: we find that not only have the column names been neatly formatted and unified, but also that the features have been converted to more efficient datatypes. With the relatively milde default settings, only 123 rows and 4 columns, of which one column was singled valued, have been eliminated. This leaves us with a lightweight DataFrame of shape: (183337, 62) and 58 MB memory usage.

归根结底,总而言之:我们发现不仅列名被整齐地格式化和统一了,而且功能也已转换为更有效的数据类型。 使用相对温和的默认设置,仅消除了123行和4列(其中一列为单值)。 这为我们提供了一个形状轻巧的DataFrame:(183337,62)和58 MB的内存使用量。

相关图 (Correlation Plots)

Once the initial data cleaning is done, it makes sense to take a look at the relationships between the features. For this we employ the function klib.corr_plot(). Setting the split parameter to “pos”, “neg”, “high” or “low” and optionally combining each setting with a threshold, allows us to dig deeper and highlight the most important aspects.

完成初始数据清理后,有必要查看一下功能之间的关系。 为此,我们使用函数klib.corr_plot() 。 将split参数设置为“ pos”“ neg”“ high”“ low”并可选地将每个设置与阈值组合在一起,使我们能够更深入地挖掘并突出最重要的方面。

plots showing high and low correlations
Correlation plots
相关图

At a glance, we can identify a number of interesting relations. Similarly, we can easily zoom in on correlations above any given threshold, let’s say |0.5|. Not only does this allow us to spot features which might be causing trouble later on in our analysis, it also shows us that there are quite a few highly negatively correlated features in our data. Given sufficient domain expertise, this can be a great starting point for some feature engineering!

乍一看,我们可以确定许多有趣的关系。 类似地,我们可以轻松放大高于任何给定阈值(例如| 0.5 |)的相关性。 这不仅使我们能够发现可能在以后的分析中引起麻烦的要素,还向我们表明,我们的数据中有很多高度负相关的要素。 有了足够的领域专业知识,对于某些功能设计来说,这可能是一个很好的起点!

high correlation plot
Plot of high absolute correlations
高绝对相关图

Further, using the same function, we can take a look at the correlations between features and a chosen target. The target column can be supplied as a column name of the current DataFrame, as a separate pd.Series, a np.ndarry or simply as a list.

此外,使用相同的功能,我们可以看一下特征与选定目标之间的相关性。 目标列可以作为当前DataFrame的列名,单独的pd.Series,np.ndarry或仅作为列表提供。

Image for post
Plot of correlations with the target / label
与目标/标签的相关图

Just as before it is possible to use a wide range of parameters for customizations, such as removing annotations, changing the correlation method or changing the colormap to match your preferred style or corporate identity.

与以前一样,可以使用各种参数进行自定义,例如删除注释,更改关联方法或更改颜色图以匹配您的首选样式或公司标识。

分类数据 (Categorical data)

In a last step in this guide, we take a quick look at the capabilities to visualize categorical columns. The function klib.cat_plot() allows to display the top and/or bottom values regarding their frequency in each column. This gives us an idea of the distribution of the values in the dataset what is very helpful when considering to combine less frequent values into a seperate category before applying one-hot-encoding or similar functions. In this example we can see that for the column “play_type” roughly 75% of all entries are made up of the three most frequent values. Further, we can immediately see that “Pass” and “Run” are by far the most frequent values (75k and 55k). Conversely, the plot also shows us that “desc” is made up of 170384 unique strings.

在本指南的最后一步,我们快速浏览了可视化分类列的功能。 函数klib.cat_plot()允许在每列中显示有关其频率的最高和/或最低值。 这使我们了解了数据集中值的分布,这在考虑在应用单热编码或类似函数之前考虑将频率较低的值组合到单独的类别中时非常有帮助。 在此示例中,我们可以看到,对于“ play_type”列,大约所有条目的75%由三个最频繁的值组成。 此外,我们可以立即看到“通过”和“运行”是迄今为止最频繁的值(75k和55k)。 相反,该图也向我们显示“ desc”由170384个唯一字符串组成。

Image for post
Categorical data plot
分类数据图

The klib package includes many more helpful functions for data analysis and cleaning, not to mention some customized sklearn pipelines, which you can easily stack together using a FeatureUnion and then use with in GridSearchCV or similar. So if you intend to take a shortcut, simply call klib.data_cleaning() and plug the resulting DataFrame into that pipeline. Likely, you will already get a very decent result!

klib软件包包括许多用于数据分析和清理的更有用的功能,更不用说一些自定义的sklearn管道了,您可以使用FeatureUnion轻松地将它们堆叠在一起,然后在GridSearchCV或类似版本中使用。 因此,如果您打算采取捷径,只需调用klib.data_cleaning()并将生成的DataFrame插入该管道即可。 可能,您已经获得了非常不错的结果!

结论 (Conclusion)

All of these functions make for a very convenient data cleaning and visualization and come with many more features and settings than described here. They are by no means a one fits all solution but they should be very helpful in your data preparation process. klib also includes various other functions, most notably pool_duplicate_subsets(), to pool subsets of the data across different features as a means of dimensionality reduction, dist_plot(), to visualize distributions of numerical features, as well as mv_col_handling(), which provides a sophisticated 3-step process, attempting to identify any remaining information in columns with many missing values, instead of simply dropping them right away.

所有这些功能使数据清理和可视化变得非常方便,并且具有比此处所述更多的功能和设置。 它们绝不是一个适合所有解决方案的方法,但是它们在您的数据准备过程中应该会非常有帮助。 klib还包括其他各种功能,最引人注目的是pool_duplicate_subsets() ,它可以将不同特征之间的数据子集合并为维方法; dist_plot() ,可以显示数字特征的分布; mv_col_handling()可以提供复杂的三步过程,尝试识别列中有许多缺失值的任何剩余信息,而不是立即将其丢弃。

Note: Please let me know what you would like to see next and which functions you feel are missing, either in the comments below or by opening an issue on GitHub. Also let me know if you would like to see some examples on the handling of missing values, subset pooling or the customized sklearn pipelines.

注意:请在下面的评论中或通过在GitHub上打开问题,让我知道您接下来想看什么以及感觉缺少哪些功能。 如果您想查看一些有关缺失值处理,子集池或自定义sklearn管道的示例,也请让我知道。

翻译自: https://towardsdatascience.com/speed-up-your-data-cleaning-and-preprocessing-with-klib-97191d320f80

实现klib

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值