机器学习模型准确率商用_我如何将机器学习模型的准确性从80持续提高到90以上...

最新推荐文章于 2021-11-11 09:05:16 发布

weixin_26630173

最新推荐文章于 2021-11-11 09:05:16 发布

阅读量574

点赞数 1

文章标签：机器学习人工智能 python 深度学习 tensorflow

原文链接：https://towardsdatascience.com/how-i-consistently-improve-my-machine-learning-models-from-80-to-over-90-accuracy-6097063e1c9a

版权

机器学习模型准确率商用

介绍 (Introduction)

If you’ve completed a few data science projects of your own, you probably realized by now that achieving an accuracy of 80% isn’t too bad! But in the real world, 80% won’t cut. In fact, most companies that I’ve worked for expect a minimum accuracy (or whatever metric they’re looking at) of at least 90%.

如果您已经完成了一些自己的数据科学项目，那么您现在可能已经意识到，达到80％的准确性还不错！但在现实世界中，80％不会削减。实际上，我工作过的大多数公司都期望至少90％的最低准确性(或他们所关注的任何度量标准)。

Therefore, I’m going to talk about 5 things that you can do to significantly improve your accuracy. I highly recommend that you read all five points thoroughly because there are a lot of details that I’ve included that most beginners don’t know.

因此，我将讨论可以极大地提高准确性的5件事。 我强烈建议您仔细阅读所有五点内容，因为其中包含了许多大多数初学者都不知道的细节。

By the end of this, you should understand that there are many more variables than you think that play a role in dictating how well your machine learning model performs.

到此为止，您应该理解，在决定机器学习模型的性能方面，有比您想象的更多的变量。

With that said, here are 5 things that you can do to improve your machine learning models!

话虽如此，您可以做以下五件事来改善您的机器学习模型！

1.处理缺失值 (1. Handling Missing Values)

One of the biggest mistakes I see is how people handle missing values, and it’s not necessarily their fault. A lot of material on the web says that you typically handle missing values through mean imputation, replacing null values with the mean of the given feature, and this usually isn’t the best method.

我看到的最大错误之一是人们如何处理缺失的价值观，这不一定是他们的错。网络上有很多资料说，您通常通过均值插补来处理缺失值，将空值替换为给定特征的均值，这通常不是最佳方法。

For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.

例如，假设我们有一个显示年龄和健身得分的表，并且假设一个八十岁的孩子缺少健身得分。如果我们将平均健身得分从15到80岁这一年龄段进行计算，那么八十岁的孩子似乎将拥有比他实际应该更高的健身得分。

Therefore, the first question you want to ask yourself is why the data is missing to begin with.

因此，您要问自己的第一个问题是为什么数据一开始会丢失。

Next, consider other methods in handling missing data, aside from mean/median imputation:

接下来，考虑除均值/中位数插补外的其他处理丢失数据的方法：

Feature Prediction Modeling: Referring back to my example regarding age and fitness scores, we can model the relationship between age and fitness scores and then use the model to find the expected fitness score for a given age. This can be done via several techniques including regression, ANOVA, and more.
特征预测建模 ：回到我关于年龄和健身得分的示例，我们可以对年龄和健身得分之间的关系进行建模，然后使用该模型查找给定年龄的预期健身得分。这可以通过多种技术来完成，包括回归，方差分析等。
K Nearest Neighbour Imputation: Using KNN imputation, the missing data is filled with a value from another similar sample, and for those who don’t know, the similarity in KNN is determined using a distance function (i.e. Euclidean distance).
K最近邻插补 ：使用KNN插补，缺失数据中填充了另一个相似样本中的值，对于不知道的数据，KNN中的相似性使用距离函数(即欧几里得距离)确定。
Deleting the row: Lastly, you can delete the row. This is not usually recommended, but it is acceptable when you have an immense amount of data to start with.
删除行 ：最后，您可以删除该行。通常不建议这样做，但是当您有大量数据开始时，它是可以接受的。

2.特征工程 (2. Feature Engineering)

The second way you can significantly improve your machine learning model is through feature engineering. Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step, which is what makes data science as much of an art as it as a science. That being said, here are some things that you can consider:

可以显着改善机器学习模型的第二种方法是通过特征工程。特征工程是将原始数据转换为更好地表示人们正在试图解决的潜在问题的特征的过程。没有具体的方法可以执行此步骤，这就是使数据科学与科学一样多的艺术。话虽如此，以下是您可以考虑的一些事项：

Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
转换DateTime变量以仅提取一周中的一天，一年中的月份等。
Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)
为变量创建箱或桶。 (例如，对于高度变量，可以为100–149cm，150–199cm，200–249cm等)
Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.
组合多个功能和/或值以创建一个新功能。例如，针对泰坦尼克号挑战的最准确模型之一设计了一个新变量“ Is_women_or_child”，如果该人是女人还是孩子，则为True，否则为false。

3.特征选择 (3. Feature Selection)

The third area where you can vastly improve the accuracy of your model is feature selection, which is choosing the most relevant/valuable features of your dataset. Too many features can cause your algorithm to overfit and too little features can cause your algorithm to underfit.

可以大大提高模型准确性的第三个领域是特征选择，即选择数据集中最相关/最有价值的特征。特征太多会导致算法过拟合，而特征太少会导致算法未拟合。

There are two main methods that I like to use that you can use to help you with selecting your features:

我喜欢使用两种主要方法来帮助您选择功能：

Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
功能重要性：一些算法(例如随机森林或XGBoost)可让您确定哪些功能在预测目标变量的值时最“重要”。通过快速创建这些模型之一并进行功能重要性，您将了解哪些变量比其他变量更有用。
Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
降维：主成分分析(PCA)是最常见的降维技术之一，它具有大量特征，并使用线性代数将其简化为更少的特征。

4.集成学习算法 (4. Ensemble Learning Algorithms)

One of the easiest ways to improve your machine learning model is to simply choose a better machine learning algorithm. If you don’t already know what ensemble learning algorithms are, now is the time to learn it!

改善机器学习模型的最简单方法之一就是简单地选择更好的机器学习算法。如果您还不知道什么是集成学习算法，那么现在该学习它了！

Ensemble learning is a method where multiple learning algorithms are used in conjunction. The purpose of doing so is that it allows you to achieve higher predictive performance than if you were to use an individual algorithm by itself.

集合学习是一种结合使用多种学习算法的方法。这样做的目的是，与单独使用单个算法相比，它可以实现更高的预测性能。

Popular ensemble learning algorithms include random forests, XGBoost, gradient boost, and AdaBoost. To explain why ensemble learning algorithms are so powerful, I’ll give an example with random forests:

流行的整体学习算法包括随机森林，XGBoost，梯度提升和AdaBoost。为了解释为什么集成学习算法如此强大，我将以随机森林为例：

Random forests involve creating multiple decision trees using bootstrapped datasets of the original data. The model then selects the mode (the majority) of all of the predictions of each decision tree. What’s the point of this? By relying on a “majority wins” model, it reduces the risk of error from an individual tree.

随机森林涉及使用原始数据的自举数据集创建多个决策树。然后，模型选择每个决策树的所有预测的模式(多数)。这有什么意义？通过依靠“多数胜利”模型，它降低了单个树出错的风险。

For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of ensemble learning!

例如，如果我们创建一个决策树，第三个决策树，它将预测0。但是，如果我们依靠所有4个决策树的模式，则预测值为1。这就是集成学习的力量！

5.调整超参数 (5. Adjusting Hyperparameters)

Lastly, something that is not often talked, but is still very important, is adjusting the hyperparameters of your model. This is where it’s essential that you clearly understand the ML model that you’re working with, otherwise it can be difficult to understand what each of the hyperparameters are.

最后，调整模型的超参数并不经常被谈论，但仍然非常重要。在这里，必须清楚地了解要使用的ML模型，否则很难理解每个超参数是什么。

Take a look at all of the hyperparameters for Random Forests:

看一下随机森林的所有超参数：

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None

For example, it would probably be a good idea to understand what min_impurity_decrease is, so that when you want your machine learning model to be more forgiving, you can adjust this parameter! ;)

例如，了解什么是min_impurity_decrease可能是一个好主意，这样，当您希望机器学习模型更加宽容时，可以调整此参数！ ;)

谢谢阅读！ (Thanks for Reading!)

By reading this article, you should now have a few more ideas when it comes to improving the accuracy of your model from 80% to 90%+. This information will also make your future data science projects go much more smoothly. I wish you the best of luck in your data science endeavors.

通过阅读本文，您现在应该将模型的精度从80％提高到90％以上。这些信息还将使您未来的数据科学项目进行得更加顺利。祝您在数据科学工作中一切顺利。