机器学习综合指南第1部分,共3部分

In this comprehensive guide, I’ll try to explore the different gears and pinions that makes a machine learning model tick. If you try to google the definition of “machine learning”, most of the sources will show the below statement:

在这份综合指南中,我将尝试探索使机器学习模型产生变化的各种齿轮和小齿轮。 如果您尝试使用Google搜索“ 机器学习 ”的定义,大多数资源将显示以下语句:

“Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed”

“机器学习是使计算机无需明确编程即可学习的能力的研究领域”

I like to think of machine learning as raising a newly born baby. At first, it’s pretty innocent of the ways of the world, and it doesn’t know what to do, and it requires help at every step. But as it slowly acquires data and knowledge, it evolves and learns to make decisions of its own.

我喜欢将机器学习视为抚养新生婴儿。 刚开始时,它对世界的方式很清白,而且不知道该怎么做,并且在每个步骤都需要帮助。 但是随着它缓慢地获取数据和知识,它不断发展并学会做出自己的决定。

In order to provide machines “the ability to learn”, we need to jump through several hoops, or as I’d like to call it “Ten Commandments of Machine Learning”. These are as listed below:

为了使机器具有“ 学习能力 ”,我们需要跳过几圈,或者我想称其为“ 机器学习的十诫 ”。 这些如下所示:

  • Acquiring data

    采集数据
  • Data Cleansing

    数据清理
  • Exploratory Data Analysis

    探索性数据分析
  • Feature Engineering

    特征工程
  • Feature Selection

    功能选择
  • Train/Validation/Test split

    训练/验证/测试拆分
  • Baseline model building

    基准模型建立
  • Hyper-parameters Tuning

    超参数调整
  • Model validation

    模型验证
  • Making Predictions

    做出预测

Now let’s walk through these commandments one-by-one, and I hope by the end you’ll have acquired sufficient knowledge to build a machine learning model of your own.

现在,让我们一步一步地遵循这些诫命,我希望到最后,您将获得足够的知识来构建自己的机器学习模型。

1)获取数据 (1) Acquiring data)

There are plenty of free-and-open-source datasets widely available on Kaggle. I’d strongly recommend to explore the below link to browse the wide variety of machine learning datasets:

Kaggle上有大量免费的开源数据集。 我强烈建议您浏览以下链接,以浏览各种机器学习数据集:

On other hand, if you’d like to create your own datasets for building the machine learning model, you can perform web-scraping on multiple websites and data sources. This is out-of-scope for this post. However you can checkout the below link to explore more on web-scraping.

另一方面,如果您想创建自己的数据集以构建机器学习模型,则可以在多个网站和数据源上进行网络爬取。 这是这篇文章的范围之外。 不过,您可以查看以下链接,以了解有关网络抓取的更多信息。

For this post, I’ll be using the “Pet Adoption” dataset hosted on HackerEarth. You can checkout the same at below link:

对于本文,我将使用HackerEarth上托管的“ Pet Adoption ”数据集。 您可以在以下链接中检出相同内容:

Image for post

2)数据清理 (2) Data Cleansing)

Let’s start by exploring the training dataset. Below image shows the top-5 records from training data.

让我们开始探索训练数据集。 下图显示了来自训练数据的前5条记录。

Image for post

We can also get a verbose description of the dataset using in-built Pandas functionality, as shown below.

我们还可以使用内置的Pandas功能获得对数据集的详细描述,如下所示。

Image for post

From the image we can see that “condition” field has only 17357 “non-null” entries, which mean there are 1477 “null” records in this field. Most machine learning models don’t have the ability to deal with null records. So we need to handle such entries manually, before feeding into the model.

从图像中我们可以看到,“ condition ”字段只有17357个“ non-null ”条目,这意味着该字段中有1477条“ null ”记录。 大多数机器学习模型都无法处理空记录。 因此,我们需要在输入模型之前手动处理此类条目。

Let’s start by doing a deep-dive into values in “condition” column. Below image shows the Pandas command and result for that.

让我们从深入研究“条件”列中的值开始。 下图显示了Pandas命令及其结果。

Image for post

As we can see, there are 3 categories in “condition” field. So for the null records, we can create a 4th category with value as “3.0”, to keep it distinct from other records. We can use Pandas “fillna” functionality to replace the null entries.

如我们所见,“ 条件 ”字段中有3个类别。 因此,对于空记录,我们可以创建第四个类别,其值为“ 3.0” ,以使其与其他记录区分开。 我们可以使用熊猫的“ fillna ”功能来替换空条目。

Image for post

You can checkout the below post for further understanding of data cleansing techniques.

您可以查看以下帖子,以进一步了解数据清理技术。

3)探索性数据分析 (3) Exploratory Data Analysis)

This is one of the most crucial step of developing a machine learning model. In this step, we will take a deep-dive into the training data and get ourselves accustomed with different data relations and data dependencies.

这是开发机器学习模型的最关键步骤之一。 在这一步中,我们将深入研究训练数据,并使自己习惯于不同的数据关系和数据依赖性。

Matplotlib and Seaborn python libraries are our best-friends here in getting insight into the training data. I’ll show few of the basis EDA steps in this post.

MatplotlibSeaborn python库是我们了解培训数据的最佳伙伴。 在这篇文章中,我将展示一些基本的EDA步骤。

  • Continuous Variables Analysis

    连续变量分析

For continuous numerical fields, we can get the distribution of data using “distplot” functionality of Seaborn library.

对于连续的数值字段 ,我们可以使用Seaborn库的“ distplot ”功能获得数据的分布。

Image for post
Image for post

As we can see from above images, the data in “length” and “height” fields are more-or-less distributed in a uniform way.

从上面的图像中我们可以看到,“ 长度 ”和“ 高度 ”字段中的数据或多或少地以统一的方式分布。

  • Categorical Variables Analysis

    分类变量分析

Typically, categorical data attributes represents discrete values which belong to a specific finite set of categories or classes. These discrete values can be text or numeric in nature. Examples: movie, music and video game genres, country names, food and cuisine types, etc.

通常,分类数据属性表示离散值,这些离散值属于类别或类的特定有限集合。 这些离散值本质上可以是文本或数字。 例如:电影,音乐和视频游戏类型,国家/地区名称,食物和料理类型等。

We can use “countplots” to understand the data distribution for categorical fields. This functionality is readily available in Seaborn library.

我们可以使用“ countplots ”来了解分类字段的数据分布。 该功能在Seaborn库中很容易获得。

In below images, we can see the data distribution for “pet_size” and “condition” fields.

在下面的图像中,我们可以看到“ pet_size ”和“ condition ”字段的数据分布。

Image for post
Image for post
  • Outliers Analysis

    离群值分析

An outlier is an observation that deviates significantly from rest of the data observations. Outliers can have many causes, such as:

离群值是与其他数据观测值明显不同的观测值。 离群值可能有多种原因,例如:

  1. Measurement or input error

    测量或输入错误
  2. Data corruption

    资料损坏
  3. True outlier observation (e.g. Michael Jordan in basketball)

    真正的异常观察(例如篮球中的迈克尔·乔丹)

We can use “boxplots” to detect outliers in data. You can refer the below link to understand more about boxplots and their functionalities.

我们可以使用“ 箱线图 ”来检测数据中的异常值。 您可以参考下面的链接,以了解有关箱形图及其功能的更多信息。

As we can see from below image, the dots towards the right of the plot are indicators of outliers in the data.

从下图可以看到,图右侧的点表示数据中的异常值。

Image for post

There are several methods to detect and treat outliers, which are explained in great details in below post. I’d recommend everyone to go through this to have better grasp at outliers handling.

有几种检测和处理异常值的方法,下面将详细介绍。 我建议大家仔细研究一下,以更好地掌握异常值处理。

For our purposes, we can use “cube-root” method of Numpy library to reduce the number of outliers, as shown in below image.

为了我们的目的,我们可以使用Numpy库的“ cube-root ”方法来减少离群数,如下图所示。

Image for post
  • Data Correlation Analysis

    数据关联分析

In layman terms, correlation is a measure of how strongly one variable depends on another.

用外行术语来说,相关性是一个变量对另一个变量的依赖程度的度量。

Consider a hypothetical dataset containing information about IT professionals. We might expect a strong relationship between age and salary, since senior project managers will tend to be paid better than young pup engineers. On the other hand, there is probably a very weak, if any, relationship between shoe size and salary.

考虑一个假设的数据集,其中包含有关IT专业人员的信息。 我们可能期望年龄和工资之间存在牢固的关系,因为高级项目经理的薪资往往会比年轻的小狗工程师更高。 另一方面,鞋子的尺码和薪水之间可能存在非常弱的关系,如果有的话。

Correlations can be positive or negative. Our age and salary example is a case of positive correlation. Individuals with a higher age would also tend to have a higher salary. An example of negative correlation might be age compared to outstanding student loan debt: typically older people will have more of their student loans paid off.

相关可以是正的负的 。 我们的年龄和薪水示例就是一个正相关的例子。 年龄较大的人的薪资也往往较高。 负相关性的一个例子可能是与未偿还的学生贷款债务相比的年龄:通常,老年人将还清更多的学生贷款。

We can use the “heatmap” functionality of Seaborn library to display the correlation of different input features in training data.

我们可以使用Seaborn库的“ 热图 ”功能来显示训练数据中不同输入特征的相关性。

Image for post

As we can see in the image, there is high “positive” correlation between “X1” and “X2” fields. Also there is high “negative” correlation between “condition” and “breed category” fields.

正如我们在图像中看到的那样,“ X1 ”和“ X2 ”字段之间存在很高的“ ”相关性。 在“ 条件 ”和“ 品种类别 ”字段之间也存在高度的“ ”相关性。

I’d recommend to go through the below post for better understanding of correlation analysis and it’s impact on machine learning models.

我建议阅读以下文章,以更好地理解相关性分析及其对机器学习模型的影响。

结束语 (Concluding Remarks)

This concludes the 1st part of the comprehensive machine learning guide. In the next post, I’ll cover the feature engineering tips and techniques, feature selection methods and train/validation/test dataset split functionality.

以上是综合机器学习指南的第一部分。 在下一篇文章中 ,我将介绍特性工程技巧和技术,特性选择方法以及训练/验证/测试数据集拆分功能。

You can find the codebase for this post at the below link. I’d highly recommend to get your own dataset (either from Kaggle or using web-scraping) and try out the different data-cleansing and EDA methods detailed in this post.

您可以在下面的链接中找到此帖子的代码库。 我强烈建议您获取自己的数据集(从Kaggle或使用Web抓取),并尝试本文中详细介绍的各种数据清理和EDA方法。

Please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

请访问我的博客(下面的链接)以探索有关机器学习和Linux计算的更多信息。

翻译自: https://medium.com/analytics-vidhya/comprehensive-guide-to-machine-learning-part-1-of-3-bbe058222278

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值