我从Kaggle机器学习竞赛中获得的经验

最新推荐文章于 2022-09-08 17:24:24 发布

cumi6497

最新推荐文章于 2022-09-08 17:24:24 发布

阅读量518

点赞数

文章标签： python 机器学习人工智能 java 大数据

原文链接：https://www.freecodecamp.org/news/what-i-learned-from-kaggle-contests-d3123e17a36b/

版权

作者Parminder Singh分享了他在Kaggle机器学习竞赛中的经验，包括使用Kaggle平台免费运行脚本，使用Amazon Web Services进行计算，以及在比赛中开始、提升排名的策略。他推荐使用XGBoost和LightGBM模型，并强调了交叉验证和特征提取的重要性。文章还鼓励读者参与Kaggle竞赛以提升技能和了解最新数据科学实践。

摘要由CSDN通过智能技术生成

by Parminder Singh

通过Parminder Singh

我从Kaggle机器学习竞赛中获得的经验 (What I’ve learned from competing in machine learning contests on Kaggle)

Recently I decided to get more serious about my data science skills. So I decided to practice my skills, which led me to Kaggle.

最近，我决定更加认真地对待我的数据科学技能。因此，我决定练习自己的技能，这使我成为了Kaggle 。

The experience has been very positive.

经验是非常积极的。

When I arrived at Kaggle, I was confused about what to do and how everything works. This article will help you to overcome the confusion that I experienced.

当我到达Kaggle时，我对要做的事以及一切运作方式感到困惑。本文将帮助您克服我所经历的困惑。

I joined the Redefining Cancer Treatment contest because it was for a noble cause. As well, the data was more manageable because it was text based.

我参加了“ 重新定义癌症治疗”比赛，因为那是一个崇高的事业。同样，数据是基于文本的，因此更易于管理。

在哪里编码 (Where to code)

What makes Kaggle great is that you don’t need a cloud server that creates results for you. Kaggle has a feature where you can run scripts and notebooks inside Kaggle for free, as long as they finish executing within an hour. I used Kaggle’s notebooks for many of my submissions, and experimented with many variables.

Kaggle之所以出色，是因为您不需要可以为您创建结果的云服务器。 Kaggle具有一项功能，您可以在其中运行脚本和笔记本只要在一个小时内完成执行，就可以免费在Kaggle内部进行。我使用Kaggle的笔记本进行许多提交，并尝试了许多变量。

Overall it was a great experience.

总体而言，这是一次很棒的经历。

For the contests, you need to use images or have a large corpus of text. And you will need a fast personal computer (PC) or a cloud container. My PC is crappy, so I used Amazon Web Services’ (AWS) c4.2xlarge instance. It was powerful enough for the text and costed only $0.40 per hour. I also had a free $150 credit from the GitHub student developer pack, so I didn’t need to worry about the cost.

对于比赛，您需要使用图像或大量文本。您将需要一台快速的个人计算机(PC)或一个云容器。我的PC笨拙，因此我使用了Amazon Web Services(AWS)c4.2xlarge实例。它的文字功能强大，每小时仅需$ 0.40。我还从GitHub学生开发者包中获得了$ 150的免费赠金，因此我无需担心成本。

Later when I took part in the Dog Breed Identification playground contest, I worked a lot with images, so I had to upgrade my instance to g2.2xlarge. It costed $0.65 per hour, but it had graphics processing unit (GPU) power, so that it could compute thousands of images in just a few minutes.

后来，当我参加“ 狗品种识别”游乐场比赛时，我做了很多图像工作，因此必须将实例升级到g2.2xlarge。花费了每小时$ 0.65，但它具有图形处理单元(GPU)的功能，因此它可以在短短几分钟内计算成千上万张图像。

The instance g2.2xlarge was still not large enough to hold all of the data I worked with, so I cached the intermediate data as files and deleted the data from RAM. I did this by using del <variable name> to avoid ResourceExhaustionError or MemoryError . Both were equally disheartening.

g2.2xlarge实例仍然不够大，无法容纳我使用的所有数据，因此我将中间数据作为文件缓存，并从RAM中删除了该数据。我通过使用del <variable na me>来做到这一点，以avoid ResourceExhaustio nErr or or Memor yError。两者同样令人沮丧。

如何开始Kaggle比赛 (How to get started with Kaggle competitions)

It’s not as scary as it sounds. The Discussion and Kernel tabs for every contest are a marvellous way to get started. A few days after the start of a contest, you will see several starter kernels appear in the Kernel tab. You can use these to get started.

它并不像听起来那样可怕。每个竞赛的“讨论”和“内核”选项卡都是一种很好的入门方法。竞赛开始几天后，您会在“内核”选项卡中看到几个入门内核。您可以使用这些入门。

Instead of handling the loading and creation of submissions, just deal with the manipulation of data. I prefer the XGBoost starter kernels. Their codes are always short and are ranked high on leaderboards.

与其处理提交的加载和创建，不如处理数据。我更喜欢XGBoost入门内核。他们的代码总是很短，并且在排行榜上排名很高。

Extreme Gradient Boosting (XGBoost) is based on the decision tree model. It is very fast and amazingly accurate, even on default variables. For large data I prefer to use Light Gradient Boosting Machine (LightGBM). It is similar in concept to the XGBoost, but approaches the problem a bit differently. There is a catch, it is not as accurate. So you can experiment using LightGBM, and when you know it is working great, switch to XGBoost (they have a similar API).

极端梯度提升 (XGBoost)基于决策树模型。即使在默认变量下，它也非常快速且精确到惊人。对于大数据，我更喜欢使用光梯度增强机 (LightGBM)。它在概念上与XGBoost相似，但是对问题的处理方式略有不同。有一个陷阱，它不那么准确。因此，您可以尝试使用LightGBM，当您知道它运作良好时，请切换到XGBoost(它们具有类似的API)。

Check the discussions every few days to see if someone has found a new approach. If someone does, use it in your script and test to see if you benefit from it.

每隔几天检查一下讨论，看看是否有人找到了新方法。如果有人这样做，请在您的脚本中使用它并进行测试以查看是否从中受益。

如何在排行榜上排名 (How to go up in the leaderboard)

So you have your starter code cooked and want to rise higher? There are many possible approaches:

因此，您已经编写了入门代码，并希望提高？有许多可能的方法：

Cross validation (CV): Always split the training data into 80% and 20%. That way when you train on 80% of the data, you can manually cross-check with 20% of the data to see if you have a good model. To quote the discussion board on Kaggle, “Always trust your CV more than the leaderboard.” The leaderboard has 50% to 70% of actual test set, so you cannot be sure about the quality of your solution based on the percentages. Sometimes your model might be great overall, but bad on the data, specifically in the public test set.
交叉验证(CV)：始终将训练数据分为80％和20％。这样，当您对80％的数据进行训练时，可以手动对20％的数据进行交叉检查，以查看您是否拥有良好的模型。在Kaggle上的讨论区中引用：“永远比排行榜更信任您的简历。” 排行榜拥有实际测试集的50％到70％，因此您无法根据百分比确定解决方案的质量。有时，您的模型总体上可能不错，但是对数据却很不利，尤其是在公共测试集中。
Cache your intermediate data: You will do less work next time by doing this. Focus on a specific step rather than running everything from the start. Almost all python objects can be pickled , but for efficiency, always use .save() and .load() functions of the library you are using for your code.
缓存您的中间数据：下次您将减少此工作。专注于特定步骤，而不是一开始就运行所有内容。几乎所有的python对象都可以被pickled ，但是为了提高效率，请始终使用要用于代码的库的.save()和.load()函数。
Use GridSearchCV: It is a great module that allows you to provide a set of variable values. It will try all possible combinations until it finds the optimal set of values. This is a great automation for optimization. A finely tuned XGBoost can beat a generic neural network in many problems.
使用GridSearchCV ：这是一个很棒的模块，允许您提供一组变量值。它将尝试所有可能的组合，直到找到最佳值集。这是优化的绝佳自动化。精心调整的XGBoost在许多问题上都可以击败通用神经网络。
Use the model appropriate to the problem: Using a knife in a gunfight is not a good idea. I have a simple approach: For text data, use XGBoost or Keras LSTM. For image data, use Pre-trained Keras model (I use Inception most of the time) with some custom bottleneck layers.
使用适合该问题的模型：在枪战中使用刀子不是一个好主意。我有一个简单的方法：对于文本数据，请使用XGBoost或Keras LSTM。对于图像数据，请使用带有一些自定义瓶颈层的预训练Keras模型(我大部分时间都使用Inception )。
Combine models: Using a kitchen knife for everything is not enough. You need a Swiss army knife. Try combining various models to get even more accurate information. For example, Inception plus the Xception model work great for image data. Combined models take a lot of RAM, which g2.2xlarge might not provide. So avoid them unless you really want to get that accuracy boost.
组合模型：仅用厨刀处理所有事情还不够。您需要一把瑞士军刀。尝试组合各种模型以获得更准确的信息。例如，Inception和Xception模型非常适合图像数据。组合模型占用大量RAM，而g2.2xlarge可能无法提供。因此，除非您真的想提高精度，否则请避免使用它们。
Feature extraction: Make the work easier for the model by extracting multiple simpler features from one feature, or combining several features into one feature. For example, you can extract the country and area code from a phone number. Models are not very intelligent, they are just algorithms that fit data. So make sure that the data is appropriate for optimal fit.
特征提取：通过从一个特征中提取多个更简单的特征，或将多个特征组合为一个特征，使模型的工作变得更容易。例如，您可以从电话号码中提取国家和地区代码。模型不是很智能，只是适合数据的算法。因此，请确保数据适合最佳拟合。

在Kaggle上还能做什么 (What else to do on Kaggle)

Other than being a competition platform for data science, Kaggle is also a platform for exploring datasets and creating kernels that explore insights into the data.

除了作为数据科学的竞争平台外，Kaggle还是一个用于探索数据集和创建用于探索数据洞察力的内核的平台。

So you can choose any dataset out of the top five that appear on the datasets page, and just go with it. The data might be weird, and you might experience difficulty as a beginner. What matters is that you analyze data and make visualizations relate to it, which contributes to your learning.

因此，您可以从出现在“ 数据集”页面上的前五名中选择任何一个数据集，然后进行选择。数据可能很奇怪，并且初学者可能会遇到困难。重要的是您分析数据并使其与可视化相关，这有助于您的学习。

用于分析的库 (Which libraries to use for analysis)

For visualizations, explore seaborn and matplotlib librariesFor data manipulation, explore NumPy and pandasFor data preprocessing, explore sklearn.preprocessing module

要进行可视化 ，请浏览seaborn和matplotlib库要进行数据处理，请浏览NumPy和pandas要进行数据预处理，请浏览sklearn.preprocessing模块

Pandas’ library has some basic plot functions too, and they are extremely convenient.intel_sorted[“Instruction_Set”].value_counts().plot(kind=’pie’)

熊猫的图书馆也有一些基本的绘图功能，它们非常方便。 intel_sorted[“Instruction_Set”].value_counts().plot(kind='pie')

The above one line of code made a pie chart with “Instruction_Set.” And the best thing is that it still looks pretty.

上面的代码行使用“ Instruction_Set”制作了一个饼图。最好的是它看起来仍然很漂亮。

为什么要做这一切 (Why do all this)

Machine learning is a beautiful field with lots of development going on. Participating in these contests will help you to learn a lot about algorithms and the various approaches to data. I myself learned a lot of these things from Kaggle.

机器学习是一个美丽的领域，正在不断发展。参加这些竞赛将帮助您学习很多有关算法和各种数据处理方法的知识。我自己从Kaggle中学到了很多这些东西。

Also, to be able to say, “My AI is in the top 15% for <insert contest name here>” is pretty dope.

另外，可以说，“我的AI在<此处插入竞赛名称>的前15％中”排名很高。