kaggle比赛数据
This article was originally written by Shahul ES and posted on the Neptune blog.
本文最初由 Shahul ES 撰写, 并发布在 Neptune博客上。
In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.
在本文中,我将讨论一些很棒的技巧和窍门,以提高结构化数据二进制分类模型的性能。 这些技巧是从Kaggle的一些顶级表格数据竞赛的解决方案中获得的。 没有太多的滞后,让我们开始吧。
These are the five competitions that I have gone through to create this article:
以下是我撰写本文时经历的五项比赛:
处理更大的数据集 (Dealing with larger datasets)
One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.
在任何机器学习竞赛中,您可能面临的一个问题是数据集的大小。 如果数据量很大,那么kaggle内核和更基本的笔记本电脑需要3GB以上的内存,您可能会发现很难用有限的资源来加载和处理数据。 这里是我发现在这种情况下有用的一些文章和内核的链接。
Faster data loading with pandas.
Data compression techniques to reduce the size of data by 70%.
数据压缩技术可将数据大小减少70% 。
Optimize the memory by reducing the size of some attributes.
通过减小某些属性的大小来优化内存。
Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.
Use cudf.
使用cudf 。
Convert data to parquet format.
将数据转换为镶木地板格式。
Converting data to feather format.
将数据转换为羽毛格式。
Reducing memory usage for optimizing RAM.
减少内存使用以优化RAM 。
数据探索 (Data exploration)
Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.
数据探索始终有助于更好地理解数据并从中获得见解。 在开始开发机器学习模型之前,顶级竞争者总是会读取/进行大量探索性数据分析。 这有助于功能设计和数据清理。
EDA for microsoft malware detection.
Time Series EDA for malware detection.
Complete