奔驰绿色制造

Author’s Note: This is the Completion report for my appliedaicourse[1] Capstone Project. All work is original and feel free to use/expand upon/disseminate. [Numbers in brackets are citations to the sources listed in the references section].

作者注意:这是我的应用课程[1] Capstone项目的完成报告。 所有工作均为原创,可以随意使用/扩展/传播。 [括号内的数字是对参考文献部分所列来源的引用]。

该项目的鸟瞰图: (Bird’s-eye view of the project:)

  • Intuition towards the given Business problem, and real-world use cases of this solution.

    对给定业务问题的直觉,以及此解决方案的实际用例。
  • Usage of ML/DL to solve the problem, downloading/scraping/extracting data from the source.

    使用ML / DL解决问题,从源下载/抓取/提取数据。
  • Data Description and Improvements to the existing approaches.

    数据描述和对现有方法的改进。
  • Exploratory data analysis with observations, plots, and Feature Engineering: which consists of 10 steps as briefed below:

    探索性数据分析,包括观察,绘图和特征工程:包括10个步骤,如下所示:

(a, b, c): Loading the data, conversion of categorical data into numerical, Missing value analysis.

(a,b,c):加载数据,将分类数据转换为数值,缺少值分析。

(d, e): Data Visualization, analysis of data.

(d,e):数据可视化,数据分析。

(f): Reducing the dimensionality of data Hugely by Detecting Multicollinearity using Variation Inflation Factor(VIF), which reduces the complexity of models, computational power.

(f):使用变化通货膨胀因子(VIF)通过检测多重共线性来大幅降低数据的维数,从而降低了模型的复杂性和计算能力。

(g): Implementing the Gavish-Donoho’s method to find the optimal value of ‘k’ and plotting the singular value curves to visualize the concept practically.

(g):实施Gavish-Donoho方法以找到“ k”的最佳值,并绘制奇异值曲线以实际形象化该概念。

(h): Finding top important features by RFECV and RFE methods.

(h):通过RFECV和RFE方法找到最重要的重要功能。

(i): Adding new features using Dimensionality Reduction techniques

(i):使用降维技术添加新功能

(j): Generating new features using the Two-way and Three-way Feature Interaction, from the top features.

(j):使用双向功能和三向功能交互从顶部功能生成新功能。

  • Tuning various models to find the best Hyperparameters, fitting models with the best Hyperparameters, analyzing how far the Feature engineering worked, and Comparing the Final results of all the Models

    调整各种模型以找到最佳的超参数,用最佳的超参数拟合模型,分析Feature工程的工作范围,并比较所有模型的最终结果

1.业务问题说明: (1. Explanation of Business Problem:)

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

自1886年第一辆汽车-奔驰专利汽车问世以来,梅赛德斯·奔驰就代表着重要的汽车创新。 这些包括,例如,具有压皱区的乘客安全室,安全气囊和智能辅助系统。 梅赛德斯-奔驰每年申请近2000项专利,使该品牌成为高级汽车制造商中的欧洲领导者。 戴姆勒的梅赛德斯-奔驰汽车是高档汽车行业的领导者。 拥有众多功能和选项,客户可以选择梦想中的定制梅赛德斯·奔驰。

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

为了确保每种独特的汽车配置在上路之前的安全性和可靠性,戴姆勒的工程师已经开发了一套强大的测试系统。 但是,如果没有强大的算法方法,则针对许多可能的功能组合优化测试系统的速度既复杂又费时。 作为全球最大的高档汽车制造商之一,戴姆勒的生产线至关重要。

Image for post

Daimler is challenging Kagglers to tackle the Curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

戴姆勒正向卡格勒汽车发起挑战,以解决尺寸问题,并减少汽车在测试台上花费的时间。 参赛者将使用代表奔驰汽车功能的不同排列的数据集来预测通过测试所需的时间。 获奖的算法将有助于加快测试速度,从而在不降低戴姆勒标准的情况下降低二氧化碳排放量。

The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively in different paths at Vehicle Testing Layout as shown in the figure below.

问题背后的动机是,通过允许具有相似测试配置的汽车在“车辆测试布局”下的不同路径中连续运行,准确的模型将能够减少测试车辆所花费的总时间,如下图所示。

Image for post
Vehicle Testing Layout
车辆测试布局

Examples of Custom features: 4WD, added air suspension, a head-up display, etc

自定义功能的示例:4WD,增加的空气悬架,平视显示器等

2. ML / DL的使用: (2. Use of ML/DL:)

This problem is an example of a Machine-Learning / Deep-Learning Regression task, to predict the continuous target variable(duration of the test).

此问题是机器学习/深度学习回归任务的一个示例,用于预测连续目标变量(测试的持续时间)。

3.数据来源: (3. Source of Data:)

Data is downloaded from this link Mercedes-Benz Greener Manufacturing, Kaggle competition[2], and unzipped.

数据是从此链接下载的,梅赛德斯·奔驰绿色制造商Kaggle竞赛[2]并已解压缩。

Thankfully this is not a big dataset. So it is added to google drive and unzipped directly in the google colab as below.

幸运的是,这不是一个大数据集。 因此,如下所示,它已添加到Google驱动器并直接在Google Colab中解压缩。

But if the dataset is too big then it is better to use the CurlWget (chrome extension) to import the data.

但是,如果数据集太大,则最好使用CurlWget(chrome扩展名)导入数据。

4.数据描述 (4. Data Description)

This dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

该数据集包含一组匿名变量,每个变量代表梅赛德斯汽车的自定义功能。 例如,变量可以是4WD,增加的空气悬架或平视显示器。

The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.

基本实况标记为“ y”,代表汽车通过每个变量测试所花费的时间(以秒为单位)。

现有方法的显着改进: (Remarkable Improvements to the existing approaches:)

  1. Detecting Multicollinearity using VIF(Variance Inflation Factor).

    使用VIF(方差膨胀因子)检测多重共线性。
  2. Finding the Optimal value of ‘k’ in TSVD, with reference[3] to the paper published by Gavish and Donoho[4] The Optimal Hard Threshold for Singular Values is 4/\sqrt{3}”.

    参考Gavish和Donoho发表的论文[3] [4]奇异值的最佳硬阈值为4 / \ sqrt {3} ”,在TSVD中找到'k'的最佳值。

5.探索性数据分析(EDA)和特征工程: (5. Exploratory Data Analysis(E.D.A) and Feature Engineering:)

( 一个 )。 加载数据集: (( a ). Loading the dataset:)

  • Dataset is loaded into a pandas DataFrame.

    数据集已加载到熊猫DataFrame中。
  • Train dataset of size (4209, 378), the Test dataset is of size (4209, 377).

    训练大小为(4209,378)的数据集,测试数据集为大小(4209,377)。
  • Out of these 8 are Categorical features, 1 feature is ID, and 368 are binary.

    在这8个特征中,分类特征,1个特征是ID和368个是二进制。
  • One extra column in the training dataset is named ‘y’, it is the class label.

    训练数据集中的另一列称为“ y”,即类标签。
Image for post
Pandas Dataframe loaded with the training dataset
装有训练数据集的Pandas Dataframe

(b)。 从分类到数值特征转换以及对齐训练和测试数据框: (( b ). Categorical to Numerical feature conversion and aligning the train and test data frames:)

  • Categorical features of train and test data are converted into numerical using pandas general function get_dummies

    使用熊猫通用函数get_dummies将火车和测试数据的分类特征转换为数值
Image for post
  • There may be a different number of unique categories a column of train and test data frame.

    一列训练和测试数据框可能有不同数量的唯一类别。
  • Due to which there is a difference in the shapes of data frames.

    因此,数据帧的形状有所不同。

output: {‘X0_aa’, ‘X0_ab’, ‘X0_ac’, ‘X0_q’, ‘X2_aa’, ‘X2_ar’, ‘X2_c’, ‘X2_l’, ‘X2_o’, ‘X5_u’, ‘y’}

输出:{'X0_aa','X0_ab','X0_ac','X0_q','X2_aa','X2_ar','X2_c','X2_1','X2_o','X5_u','y'}

  • In the above code, we are identifying non-common features

    在上面的代码中,我们正在识别非常见功能

output: (4209, 554) (4209, 554)

输出:(4209,554)(4209,554)

  • We have aligned the data frames by making an inner join of the data frames.

    我们通过对数据帧进行内部联接来对齐数据帧。

( C )。 空值/缺失值分析: (( c ). Null/Missing value analysis:)

  • Due to improper handling of missing values, the results obtained will differ from ones where missing values are not present.

    由于对缺失值的处理不当,所获得的结果将与不存在缺失值的结果有所不同。
  • Rows with missing data can be deleted or can be filled using Data Imputation techniques as mentioned in this link.[5]

    数据缺失的行可以删除,也可以使用此链接中提到的数据插补技术填充。[5]

  • In the case of multivariate analysis, if there is a larger no of missing values, then it can be better to drop those cases(rather than to do imputation and replace them).

    在多变量分析的情况下,如果没有更大的缺失值,则最好丢弃这些情况(而不是进行插补和替换)。
  • On the other hand in univariate analysis, imputation can decrease the amount of bias in the data, if the values are missing at random.[6]

    另一方面,在单变量分析中,如果值随机丢失,则插补可以减少数据中的偏差量。 [6]

  • But in our dataset doesn’t have any missing values.

    但是在我们的数据集中,没有任何缺失值。

(d)。 数据可视化: (( d ). Data Visualization:)

First, let’s take only the Class variable and plot it on the y_axis and it’s resettled indices on the x_axis.

首先,让我们仅使用Class变量,并将其绘制在y_axis上,并将其重新设置的索引绘制在x_axis上。

  • Because ID’s are not continuous units.

    因为ID不是连续的单位。
  • Over plotting is one of the most common problems in DataViz. When your dataset is big, dots of your scatterplot tend to overlap, hence we reduced the size of dots to accommodate more number of dots in a unit area.

    过度绘制是DataViz中最常见的问题之一。 当数据集很大时,散点图的点往往会重叠,因此我们减小了点的大小,以在单位面积上容纳更多的点。
Image for post
  • From the above diagram, we can see that the class label(y-time) looks like a line, apart from a small portion of points at the ends are not on the line.

    从上面的图中,我们可以看到类标签(y时间)看起来像一条线,除了末端的一小部分点不在线上。
  • and also there’s only one point whose time is above 250 which is an outlier.

    而且只有一点时间超过250,这是一个离群值。
  • Because of all the class labels not lying on a line, the Metric R² square won’t have large values, it’s very sensitive to outliers, as SSres increases.

    由于所有类别标签均不在一条直线上,因此MetricR²平方不会具有较大的值,因为SSres增大,因此它对异常值非常敏感。
Image for post
  • The best possible R² Square value is 1.0.

    最佳R²平方值是1.0。

Plotting the PDF, CDF, and BoxPlot of the Class Variable:

绘制类变量的PDF,CDF和BoxPlot:

Image for post
PDF
PDF格式
Image for post
CDF
CDF

From the PDF and CDF, we can see that:

从PDF和CDF中,我们可以看到:

  • Almost all data points have a Class variable below 140.

    几乎所有数据点的Class变量都低于140。
  • so the points having a class label more than 140 can be considered as outliers.

    因此类别标签大于140的点可被视为离群值。
Image for post
BoxPlot
箱形图
  • BoxPlot drawn concerning class label very beautifully shows the distribution of data based on a five numbered summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”) and here we can consider the values which are larger than max value as outliers for sure.

    关于类标签绘制的BoxPlot非常漂亮地显示了基于五个编号的摘要(“最小”,第一四分位数(Q1),中位数,第三四分位数(Q3)和“最大”)的数据分布,在这里我们可以考虑肯定大于离群值的最大值。
  • Outlier data points are dropped.

    离群数据点被删除。

(e)。 删除具有唯一值的列: (( e ). Dropping columns with unique values:)

  • Features with unique values don’t contribute any valuable information, instead, they increase the number of dimensions, hence we are dropping such features.

    具有唯一值的要素不会提供任何有价值的信息,相反,它们会增加维数,因此我们将其删除。
  • [‘X11’, ‘X93’, ‘X107’, ‘X233’, ‘X235’, ‘X268’, ‘X289’, ‘X290’, ‘X293’, ‘X297’, ‘X330’, ‘X347’], these features contain only zero’s hence they are dropped.

    ['X11','X93','X107','X233','X235','X268','X289','X290','X293','X297','X330','X347'],这些功能仅包含零,因此将其删除。

( F )。 使用方差膨胀因子(VIF)检测多重共线性: (( f ). Detecting Multicollinearity using Variance Inflation Factor(VIF):)

  • Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.[7][8]

    当两个或多个自变量在回归模型中彼此高度相关时,就会发生多重共线性。 [7] [8]

  • This means that one independent variable can be predicted from another independent variable in a regression model.

    这意味着可以从回归模型中的另一个自变量预测一个自变量。
  • This can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable.

    在回归模型中这可能是一个问题,因为我们将无法区分自变量对因变量的个体影响。
  • Multicollinearity may not affect the accuracy of the model as much. But we might lose reliability in determining the effects of individual features on the model. and that can be a problem when it comes to interpretability

    多重共线性可能不会太大地影响模型的准确性。 但是,在确定各个特征对模型的影响时,我们可能会失去可靠性。 这在可解释性方面可能是一个问题
  • This is a Bivariate analysis.

    这是双变量分析。
  • VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable (or) the VIF score of an independent variable represents how well the variable is explained by other independent variables.

    VIF确定独立变量之间相关性的强度。 通过采用变量并将其与其他所有变量进行回归来预测( 或)自变量的VIF得分表示该变量对其他自变量的解释程度。

  • VIF — Conclusion:1 = No Multicollinearity, 4–5 = Moderate, 10 or greater = Severe.

    VIF —结论: 1 =无多重共线性,4-5 =中等,10或更大=严重。

Image for post
  • Generally, a VIF value greater than 10 is considered as severe, whereas in our dataset we even have features with VIF value infinity and even 3 digit numbers.

    通常,VIF值大于10被认为是严重的,而在我们的数据集中,我们甚至具有VIF值无穷大甚至3位数字的特征。
  • We have dropped all the features which have a VIF score of infinity, excluding the top_20_features (details of the top_20_features will be explained in the 7th section).

    我们删除了VIF分数为无穷大的所有功能,但top_20_features除外(top_20_features的详细信息将在第7部分中进行说明)。

( G )。 在截断的SVD中,使用Gavish-Donoho方法[3] [4] [9]找到最佳的k值 (( g ). Finding the Optimal value of ‘k’ in Truncated SVD, Gavish-Donoho method[3][4][9])

Why Gavish-Donoho method? what’s explained in it? what are the proven conclusions derived from this paper?

为什么采用Gavish-Donoho方法? 里面有什么解释? 从本文得出的证明结论是什么?

Image for post
  • Truncated SVD is a matrix factorization technique that factors a matrix W into three matrices U, S, VT. Typically it is used to find the principal components of a matrix

    截断SVD是一种矩阵分解技术,可将矩阵W分解为三个矩阵U,S,VT。 通常,它用于查找矩阵的主要成分

  • Truncated SVD is different from regular SVD. Given an n*n matrix, SVD will produce matrices with n columns, whereas Truncated SVD will produce matrices with a specified number of columns.

    截断的SVD与常规的SVD不同。 给定一个n * n矩阵,SVD将生成具有n列的矩阵,而截断SVD将生成具有指定列数的矩阵。
  • We need to Truncate the SVD because we need the matrix of optimal columns ‘k’ that could accommodate maximum information. For example, if we have the rank ‘k’ = n, the maximum it could be(Higher), then complete information will be preserved(noise also) and accuracy might or might not be more but, the complexity of the model will be high; If we have the rank ‘k’ very low then the information might be lost and model might be less accurate and model will not be too complex.

    我们需要截断SVD,因为我们需要可以容纳最大信息的最佳列“ k”的矩阵。 例如,如果我们具有等级“ k” = n,则它的最大值可能是(更高),则将保留完整信息(也有噪声),并且准确性可能会或可能不会更多,但是模型的复杂性将是高; 如果我们的“ k”等级很低,那么信息可能会丢失,模型可能会不太准确,模型也不会太复杂。
  • So, we need to find the sweet spot, optimal ‘k’ where we get the most of the information in W but without overfitting, to noise, or some little features we don’t care about.

    因此,我们需要找到最佳点,即最佳的“ k”,以便在其中获得W中的大部分信息,而又不会过拟合,噪声或一些我们不关心的小功能。
  • This can be done in many ways by analyzing the singular values, and find out the elbow or knee but, they don’t work unless you have a sharp dropoff in the singular values. Hence, Gavish and Donoho’s method is the best one to find out the optimal rank ‘k’, given some assumptions on the data.

    这可以通过分析奇异值并找出肘部或膝盖的多种方式来完成,但是除非您的奇异值急剧下降,否则它们将不起作用。 因此,给定一些数据假设,Gavish和Donoho的方法是找出最佳等级“ k”的最佳方法。
Image for post
  • As written in the above equation, our data X can be written as the sum of the true low-rank data signal(Xtrue), and Xnoise which is assumed to be Normally distributed with Zero mean and variance=1 also called as Gaussian noise and it can be large or small depending on the magnitude of gamma.

    如上式所示,我们的数据X可以写成真实的低秩数据信号(Xtrue)与Xnoise的和,假定Xnoise以零均值和方差= 1呈正态分布,也称为高斯噪声和根据伽玛的大小,它可以大也可以小。
Image for post
  • The Orange curve corresponds to the Gaussian noise matrix, and the Green curve corresponds to our actual high dimensional data

    橙色曲线对应于高斯噪声矩阵,绿色曲线对应于我们实际的高维数据
  • Gavish and Donoho realized that when the singular values from the SVD of high-dimensional data when plotted, the curve(Green one) looks like the curve corresponding to the singular values from the SVD of the best fit Gaussian noise matrix, and at some point it deviates as shown in the above figure, and here that level is named as Noise floor.

    Gavish和Donoho意识到,当绘制高维数据的SVD的奇异值时,曲线(绿色的)看起来像是与最佳拟合高斯噪声矩阵的SVD的奇异值相对应的曲线。它偏离了上图所示的位置,在这里该级别被称为“ 噪底”。

  • This Noise floor separates the signal and noise.

    底噪声将信号和噪声分开。

  • The first singular value that is larger than the biggest singular value of the noise matrix is the Threshold, and values below it are truncated.

    大于噪声矩阵的最大奇异值的第一个奇异值是阈值,其下的值被截断。
  • The application of this method is explained below taking the two possible cases.

    下面以两种可能的情况说明该方法的应用。

Case 1: X is a square matrix and gamma is known.

情况1:X是一个方矩阵,并且伽玛是已知的。

  • Truncate all the values below the threshold(tau) value

    截断低于阈值(tau)的所有值
Image for post
  • here n = dimensions of square matrix X, gamma= amount of noise(known)

    这里n =方阵X的尺寸,gamma =噪声量(已知)

Case 2: X is a rectangular matrix and gamma is unknown.

情况2:X是矩形矩阵,伽玛未知。

  • In this case, all we have is measurements of Singular value distribution.

    在这种情况下,我们只有奇异值分布的度量。
Image for post
  • Based on the median singular value and aspect ratio of the rectangular matrix, we can infer the best fit noise distribution.

    根据矩形矩阵的中值奇异值和纵横比,我们可以推断出最佳拟合噪声分布。
Image for post
Image for post

Conclusion:

结论:

  • We have data that has structure and noise, even if we don’t know how much noise is added we can estimate it from the median singular value, and then we can infer the Optimal tau(threshold), to truncate the singular values below tau to give the optimal rank ‘r’

    我们拥有具有结构和噪声的数据,即使我们不知道添加了多少噪声,也可以从中值奇异值进行估计,然后可以推断出最优tau(threshold),以将tau之下的奇异值截断给出最佳等级“ r”

Code:

码:

  • After applying all the preprocessing steps above the data is stored in a pandas data frame with the name x_filtered.

    在执行上述所有预处理步骤后,数据将存储在名称为x_filtered的熊猫数据框中。
  • from the code snippet above we’ll get the singular matrices out data matrix and Gaussian noise matrix

    从上面的代码片段中,我们将获得数据矩阵和高斯噪声矩阵的奇异矩阵
Image for post
  • On plotting, we got the above curves and the horizontal line at y=tau

    在绘制时,我们得到了上面的曲线和水平线,y = tau
  • Hence, we have decided to take the value of ‘k’ as 2 to truncate.

    因此,我们决定将“ k”的值设为2进行截断。

( H )。 查找top_20_features: (( h ). Finding the top_20_features:)

  • Here we found the important features using the Recursive feature elimination

    在这里,我们使用递归功能消除功能找到了重要功能
  • sklearn’s RFECV automatically generates an optimal number of important features and RFE generates top n features according to our demand.

    sklearn的RFECV会根据我们的需求自动生成最佳数量的重要特征,而RFE会生成前n个特征。

output: Index([‘X314’], dtype=’object’)

输出:索引(['X314'],dtype ='object')

  • RFECV using RandomForestRegressor with the best parameters which are obtained by tuning it on the dataset.

    使用具有最佳参数的RandomForestRegressor的RFECV,该参数是通过在数据集中对其进行调整而获得的。

output: Index([‘X29’, ‘X314’, ‘X315’], dtype=’object’)

输出:索引(['X29','X314','X315'],dtype ='object')

  • RFECV using the default XGBRegressor

    使用默认XGBRegressor的RFECV

output: Index([‘X314’], dtype=’object’)

输出:索引(['X314'],dtype ='object')

  • RFECV using DecisionTreeRegressor using the best max_depth which is found by tuning the model on the dataset.

    使用使用最佳max_depth的DecisionTreeRegressor的RFECV,这是通过调整数据集上的模型找到的。
  • From the output of the above three cells, we have learned that X314, X315, and X29 are the most important features and X314 is more important that X315 and X29

    从以上三个单元的输出中,我们了解到X314,X315和X29是最重要的功能,而X314比X315和X29更重要
  • Using Recursive feature elimination we will find the top 20 important features and perform bivariate analysis on them

    使用递归特征消除,我们将找到最重要的20个重要特征并对它们执行双变量分析

output: Index([‘ID’, ‘X29’, ‘X48’, ‘X54’, ‘X64’, ‘X76’, ‘X118’, ‘X119’, ‘X127’, ‘X136’, ‘X189’, ‘X232’, ‘X263’, ‘X279’, ‘X311’, ‘X314’, ‘X315’, ‘X1_aa’, ‘X6_g’, ‘X6_j’], dtype=’object’)

输出:索引(['ID','X29','X48','X54','X64','X76','X118','X119','X127','X136','X189',' X232”,“ X263”,“ X279”,“ X311”,“ X314”,“ X315”,“ X1_aa”,“ X6_g”,“ X6_j”],dtype =“ object”)

  • RFE using RandomForestRegressor to output top_20_features

    使用RandomForestRegressor的RFE输出top_20_features
  • This set of top_20_features is the superset of the important features obtained by RFECV.

    这组top_20_features是RFECV获得的重要功能的超集

( 一世 )。 使用降维技术添加新功能: (( i ). Adding new features using Dimensionality reduction techniques:)

TSVD:

TSVD:

  • As found from the Gavish and Donoho’s method, we are using 2 components of Truncated SVD

    从Gavish和Donoho的方法中发现,我们使用了截断SVD的2个组件

output: (4194, 2)

输出:(4194,2)

  • Also generating 2 features from PCA and ICA, available features reduction techniques at sklearn.decompositon, and let’s see whether they are useful or not.

    还从PCA和ICA生成2个功能,在sklearn.decompositon上提供了可用的功能减少技术,让我们看看它们是否有用。

PCA:

PCA:

output: (4194, 2)

输出:(4194,2)

ICA:

ICA:

output: (4194, 2)

输出:(4194,2)

  • Adding all the new features generated through Dimensionality reduction techniques, to the data frames

    将通过降维技术生成的所有新功能添加到数据框中

output: (4194, 127) (4209, 127)

输出:(4194,127)(4209,127)

(j)。 使用top_20_important功能的双向和双向功能交互来生成新功能 (( j ). Generating new features using the Two-way and Three-way feature interaction of the top_20_important features)

6.建模: (6. Modeling:)

( 一个 )。 RandomForestRegressor。 (( a ). RandomForestRegressor.)

  • Let’s perform Hyperparameter tuning

    让我们执行超参数调整
  • Initializing all the parameters

    初始化所有参数
  • Fitting the RandomSearchCV model

    拟合RandomSearchCV模型

output: {‘bootstrap’: True, ‘max_depth’: 70, ‘max_features’: ‘auto’, ‘min_samples_leaf’: 40, ‘min_samples_split’: 110, ‘n_estimators’: 500}

输出:{'bootstrap':True,'max_depth':70,'max_features':'auto','min_samples_leaf':40,'min_samples_split':110,'n_estimators':500}

  • printing the best parameters

    打印最佳参数
  • Initiating a model with the best hyperparameters, and fitting it to the data set.

    使用最佳超参数启动模型,并将其拟合到数据集。
  • Plotting bar plots of the relative importance of the features of this model, in predicting the class label.

    在预测类别标签时,绘制此模型功能的相对重要性的条形图。
Image for post
  • As we can see the feature ‘X314+X315’ generated by Two-way feature interaction, played an important role in predicting the class label.

    如我们所见,双向特征交互生成的特征“ X314 + X315”在预测类标签中起着重要作用。
  • Like RandomForestRegressor, for other models below, the same procedure is carried out, check out the code in my GitHub.

    像RandomForestRegressor一样,对于下面的其他模型,将执行相同的过程,请在我的GitHub中检查代码。

(b)。 XGBoostRegressor。 (( b ). XGBoostRegressor.)

  • Coding is similar for all the models, hence it’s not presented in this blog, check out my GitHub repository for code.

    编码对于所有模型都是相似的,因此本博客中未介绍它,请查看我的GitHub存储库以获取代码。
Image for post
  • Even in this model, the ‘X314+X315’ feature has the highest relative importance compared to other features, but the relative importance index is lower compared to that of RandomForestRegressor.

    即使在此模型中,“ X314 + X315”功能也比其他功能具有最高的相对重要性,但相对重要性指数却低于RandomForestRegressor。

( C )。 DecisionTreeRegressor。 (( c ). DecisionTreeRegressor.)

Image for post
  • Even in this model, the ‘X314+X315’ feature has the highest relative importance compared to other features.

    即使在此模型中,“ X314 + X315”功能也比其他功能具有最高的相对重要性。

下表汇总了所有模型的结果,有关代码,请参阅GitHub。 (The results of all models are summarized in the table below, and for the code refer GitHub.)

7.所有模型的比较: (7. Comparison of all the models:)

Image for post
  • Out of all the models RandomForestRegressor got the highest public score.

    在所有模型中,RandomForestRegressor的公共得分最高。
Image for post
Screenshot from kaggle submission
来自kaggle提交的屏幕截图

8.未来的工作 (8. Future Work)

  • New important features should be generated to improve the performance of the model.

    应该生成新的重要特征以改善模型的性能。
  • Deep Learning Models should be applied and tuned to improve results.

    深度学习模型应被应用和调整以改善结果。

9.参考 (9. References)

  1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

    https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

  2. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data

    https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data

  3. https://arxiv.org/pdf/1305.5870.pdf

    https://arxiv.org/pdf/1305.5870.pdf

  4. https://ieeexplore.ieee.org/document/6846297

    https://ieeexplore.ieee.org/document/6846297

  5. https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2F6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

    https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2F6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

  6. https://medium.com/r/?url=https%3A%2F%2Fweb.stanford.edu%2Fclass%2Fstats202%2Fcontent%2Flec25-cond.pdf

    https://medium.com/r/?url=https%3A%2F%2Fweb.stanford.edu%2Fclass%2Fstats202%2Fcontent%2Flec25-cond.pdf

  7. https://medium.com/r/?url=https%3A%2F%2Fwww.sigmamagic.com%2Fblogs%2Fwhat-is-variance-inflation-factor%2F%23%3A~%3Atext%3DIf%2520there%2520is%2520perfect%2520correlation%2Cto%2520the%2520presence%2520of%2520multicollinearity

    https://medium.com/r/?url=https%3A%2F%2Fwww.sigmamagic.com%2Fblogs%2Fwhat-is-variance-inflation-factor%2F%23%3A~%3Atext%3DIf%2520there% 2520is%2520perfect%2520correlation%2Cto%2520the%2520presence%2520of%2520multicollinearity

  8. https://medium.com/r/?url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F03%2Fwhat-is-multicollinearity%2F

    https://medium.com/r/?url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F03%2Fwhat-is-multicollinearity%2F

  9. https://medium.com/r/?url=http%3A%2F%2Fwww.pyrunner.com%2Fweblog%2F2016%2F08%2F01%2Foptimal-svht%2F

    https://medium.com/r/?url=http%3A%2F%2Fwww.pyrunner.com%2Fweblog%2F2016%2F08%2F01%2Foptimal-svht%2F

10. GitHub和LinkedIn (10. GitHub and LinkedIn)

翻译自: https://medium.com/@nvinay65/mercedes-benz-greener-manufacturing-2181015ee378

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值