如何准备机器学习数据集_数据准备技术及其在机器学习中的重要性

最新推荐文章于 2024-03-05 08:55:31 发布

weixin_26752765

最新推荐文章于 2024-03-05 08:55:31 发布

阅读量667

点赞数

文章标签：机器学习人工智能 python 大数据 java

原文链接：https://medium.com/swlh/data-preparation-techniques-and-its-importance-in-machine-learning-4a9df5d258c0

版权

如何准备机器学习数据集

什么是数据？ (What is Data?)

Data refers to examples of cases from the domain that characterize the problem you want to solve and the choice of data is dependent on the objective you want to satisfy. Here are some commonly used websites where open-source data is available for each topic so that you can build your own Machine Learning application and contribute to global success.

数据是指领域中代表您要解决的问题的案例，数据的选择取决于您要满足的目标。这是一些常用的网站，每个主题都可以使用开源数据，以便您可以构建自己的机器学习应用程序并为全球成功做出贡献。

Kaggle — An organised platform, where each learner will learn to spend time. You will love how futuristic these datasets are, and with the help of kernels, you can process all in the platform without even downloading the data.

Kaggle —一个有组织的平台，每个学习者都将在这里学习花费时间。您会喜欢这些数据集的未来性，借助内核，您甚至无需下载数据就可以在平台上进行所有处理。

UCI Machine Learning Repository — It maintains a huge amount of diversified datasets as a service to the machine learning community

UCI机器学习存储库 -它维护着大量的多样化数据集，作为对机器学习社区的服务

Data.gov — You can download data from multiple Indian government ministries. Data can range from government budgets to school performance scores.

Data.gov —您可以从多个印度政府部门下载数据。数据范围从政府预算到学校成绩。

CMU Libraries — High-quality dataset from various domains an initiative by Carnegie Mellon University

CMU库 —来自卡内基·梅隆大学的一项举措，来自各个领域的高质量数据集

Google Dataset Search — This dataset search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s web page.

Google数据集搜索 -通过此数据集搜索，无论托管位置是发布者的网站，数字图书馆还是作者的网页，您都可以查找它们。

After collecting the data, the first task is to transform the data to meet the requirements of individual machine learning algorithms. The most challenging part of each machine learning project is how to prepare the one thing that is unique to the project i.e. The data used for modelling.

收集数据之后，首要任务是转换数据以满足各个机器学习算法的要求。每个机器学习项目中最具挑战性的部分是如何准备该项目特有的一件事，即用于建模的数据 。

Data preparation is the transformation of raw data into the form that is more suitable for modelling because “the quality of data is more important than using complicated algorithms”. And to transform the raw data into more informative and self-explanatory you need to perform the same type of data preparation task for any modelling problem.

数据准备是将原始数据转换为更适合建模的形式，因为“数据质量比使用复杂算法更重要”。要将原始数据转换为内容更丰富，更易于解释的数据，您需要针对任何建模问题执行相同类型的数据准备任务。

In this article, we will walk you through how to apply Data Preparation techniques using the Car Price Prediction Dataset as an example

在本文中，我们将指导您如何使用“汽车价格预测数据集”作为示例来应用数据准备技术

Data Preparation tasks are :

数据准备任务是：

Data Cleaning
数据清理
Feature Engineering
特征工程
Data Transformation
数据转换
Feature Extraction
特征提取

数据清理： (Data Cleaning :)

This is one of the hardest steps, as most of the real-world data may have incorrect values in the form of misleading observation, wrong entry of data or rows may store incorrect values and many more but to clean data in order to create reliable dataset you need to have domain expertise which helps you to identify and observe abnormalities within attributes.

这是最难的步骤之一，因为大多数现实世界中的数据可能以误导性观察的形式出现不正确的值，错误输入数据或行可能会存储不正确的值，还有更多内容需要清理，以创建可靠的数据集您需要具有领域专业知识，以帮助您识别和观察属性中的异常。

Needless to say, data cleaning is a time-consuming process and you will spend an enormous amount of time in enhancing the quality of the data. However, there are various methods to perform data cleaning operations.

不用说，数据清理是一个耗时的过程，您将花费大量时间来提高数据质量。但是，有多种方法可以执行数据清理操作。

Detecting the number of null values or missing rows
检测空值或缺少行的数量
Identifying duplicate rows within data and remove them
识别数据中的重复行并将其删除
Using domain knowledge to detect outliers based on some statistical technique
基于某种统计技术，使用领域知识来检测离群值
Identifying the distribution of columns and eliminating columns having no variance
识别列的分布并消除无差异的列

Here is the code snippet performed to clean “car price prediction” dataset.

这是为清除“汽车价格预测”数据集而执行的代码段。

功能工程： (Feature Engineering:)

Feature engineering is a process of creating new input variables from the available data. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

特征工程是根据可用数据创建新输入变量的过程。通常，您可以将数据清理视为减法过程，将特征工程视为添加过程。

Engineering new features are highly specific to your data and data types and is one of the most valuable tasks a data scientist can do to improve model performance because it helps to isolate and highlight key information, which helps the algorithms to “focus” on what’s important.

设计新功能是针对您的数据和数据类型的高度特定的功能，并且是数据科学家可以提高模型性能的最有价值的任务之一，因为它有助于隔离和突出显示关键信息 ，从而帮助算法“ 专注于”重要信息。

Some common feature engineering techniques are:

一些常见的特征工程技术是：

Binning
装箱
Log Transformation
日志转换
Feature split
功能拆分
Combining sparse classes
组合稀疏类

Feature engineering is much more similar to data transformation but due to the requirement of subject matter (i.e. understanding domain), expert feature engineering defines a separate preparation technique.

特征工程与数据转换非常相似，但是由于主题(即理解领域)的要求，专家特征工程定义了单独的准备技术。

Feature Engineering steps performed on the dataset are :

对数据集执行的特征工程步骤为：

数据转换： (Data Transformation :)

It is rare that the collected data can be solely used to make predictions. The data you have may not be in the right format or may require transformations to make it more useful. Data transforms are used to change the type of distribution of data variables.

收集到的数据仅可用于进行预测的情况很少。您拥有的数据可能格式不正确，或者可能需要进行转换以使其更有用。数据转换用于更改数据变量的分布类型。

Attributes in the data store information in the form of numerical or categorical values, but the limitation of machine learning algorithms is, it can process only numerical values and not any kind of string or character. So you have to encode the categorical variable as an integer by creating dummy variables or performing one-hot encoding.

数据中的属性以数字或分类值的形式存储信息，但是机器学习算法的局限性在于，它只能处理数字值，不能处理任何类型的字符串或字符。因此，您必须通过创建虚拟变量或执行一键编码将分类变量编码为整数。

After representing the categorical attributes into numerical once, your task is to verify the scale of all features and align them into one scale because for a computer there is dramatically more resolution in the range 0–1 than in the broader range of the data type. Scaling the data can be achieved by applying normalization (rescaling the feature in the range of 0 to 1) or by applying standardization (scaling the feature to their respective gaussian)

在将分类属性一次用数字表示之后，您的任务是验证所有功能的比例并将它们调整为一个比例，因为对于计算机而言，0-1范围内的分辨率比数据类型范围内的分辨率大得多。可以通过应用归一化(将特征缩放到0到1范围内)或通过应用标准化(将特征缩放到各自的高斯)来实现数据缩放。

Data Transformation steps performed on the data are :

对数据执行的数据转换步骤为：

Please note that we have split the data into two sets; one for training the algorithm, and another for evaluation purposes, and for further analysis, we would be concentrating on training data

请注意，我们已将数据分为两组；一个用于训练算法，另一个用于评估目的，为了进一步分析，我们将专注于训练数据

特征提取： (Feature Extraction:)

Feature Extraction is the process of selecting the subset from the existing feature list or reducing the dimensionality of the dataset by applying various dimensionality reduction algorithms. The number of input features for a dataset is considered as the dimensionality of the data.

特征提取是从现有特征列表中选择子集或通过应用各种降维算法来降低数据集的维数的过程。数据集的输入要素的数量被视为数据的维数。

The problem is that more the number of dimensions (i.e. more input variables), more likely the dataset represents a very sparse and unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

问题在于，维数越多(即，输入变量越多)，数据集就越有可能代表该空间的非常稀疏且不具代表性的采样。这被称为维数的诅咒。

This can be done in two major ways :

这可以通过两种主要方式完成：

Feature Selection: this aims to rank the importance of the existing features in the dataset and discard less important ones
特征选择：这旨在对数据集中现有特征的重要性进行排名，并丢弃次要的特征
Feature Extraction: it creates a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.
特征提取：它将数据投影到较低维的空间中，该空间仍保留原始数据的最重要属性。

The benefits of feature extraction/selection techniques are

特征提取/选择技术的好处是

Reduce the overfitting/ underfitting
减少过度拟合/不足
Reduce the training time
减少培训时间
Improves the accuracy
提高准确性
Improves data visualization
改善数据可视化

Feature Extraction technique performed on the data

对数据执行特征提取技术

After performing feature selection on the training dataset you can now train the ML model, as one is used here (Linear Regression) without performing any advance analysis on the algorithm. The model obtained is evaluated by using a set of matrices, regression problem performance is computed on the basis of “Mean Square Error” and “Mean Absolute Error”.

在训练数据集上执行特征选择后，您现在可以训练ML模型，因为此处使用的是线性回归(线性回归)，而无需对算法进行任何高级分析。使用一组矩阵评估获得的模型，并基于“ 均方误差”和“均值绝对误差”计算回归问题的性能。

结论： (Conclusion:)

These are some of the basic steps to be considered while solving an ML problem. Because, while building an ML model, the most important and the hardest part is cleaning and pre-processing the data. Ironically, applying the algorithm and predicting the output is just a few lines of code, which is the easiest job while building.

这些是解决ML问题时要考虑的一些基本步骤。因为在建立ML模型时，最重要和最困难的部分是清理和预处理数据。具有讽刺意味的是，应用算法并预测输出只是几行代码，这是构建时最简单的工作。

Let us know if you like the blog, please do comment for any queries or suggestions and follow us on LinkedIn and Instagram. Your love and support inspires us to post our learnings in much better way..!!

让我们知道您是否喜欢该博客，如有任何疑问或建议，请发表评论，并在LinkedIn和Instagram上关注我们。您的爱与支持会激励我们以更好的方式发布我们的知识。