多分类 kaggle_从kaggle笔记本中获得更多收益

多分类 kaggle

I joined Kaggle about five years ago. I got to know about this site from a MOOC that I was undertaking at that time. From 2015 till 2019, I had been using Kaggle only to download datasets. I did attempt the immensely popular Titanic Competition to change my status from green to blue, i.e. from Novice to Contributor, but other than that I wasn’t very much active on the platform. It was only in late 2019 that I started actively contributing and writing notebooks on Kaggle. From analysis to exploratory Data Analysis, I experimented with a lot of ideas. I studied other people’s work, took inspirations and learnt a lot. Finally, after months of Kaggling, I became a Kaggle Notebook GrandMaster in June 2020.

我大约五年前加入Kaggle 。 我当时从MOOC那里了解了这个站点。 从2015年到2019年,我一直只使用Kaggle下载数据集。 我确实尝试过非常流行的《泰坦尼克号比赛》,将我的状态从绿色更改为蓝色,即从新手更改为贡献者 ,但除此之外 ,我在平台上的参与度不是很高。 直到2019年末,我才开始在Kaggle上积极贡献和编写笔记本。 从分析到探索性数据分析,我尝试了很多想法。 我学习了其他人的作品,获得了启发并学到了很多东西。 终于,经过数月的Kaggling训练后,我于2020年6月成为Kaggle笔记本电脑大师。

Image for post

This article is a compilation of my learnings when it comes to writing effective notebooks. This isn’t a guide but just my experiences over the course of months. Let’s first begin by understanding what a Kaggle Notebook is and how is it used.

本文是我撰写有效笔记本时所学知识的汇总。 这不是指南,而是我几个月的经验。 首先,让我们先了解一下Kaggle笔记本电脑及其使用方法。

什么是Kaggle笔记本? (What are Kaggle Notebooks?)

A Notebook is a storytelling format for sharing code and analyses. It is a cloud computing environment that enables reproducible and collaborative work. Anyone can create a Notebook right in Kaggle and embed charts directly into them. Kaggle Notebooks are of two kinds:

笔记本是一种讲故事的格式,用于共享代码和分析。 它是一个云计算环境,可实现可重复和协作的工作。 任何人都可以直接在Kaggle中创建一个Notebook,然后将图表直接嵌入其中。 Kaggle笔记本有两种:

  • Scripts — files that execute everything as code sequentially

    脚本 -依次将所有内容作为代码执行的文件

  • Notebooks — Jupyter notebooks consisting of a sequence of cells

    笔记本 — Jupyter笔记本,由一系列单元格组成

In the Notebooks IDE, you have access to an interactive session running in a Docker container with pre-installed packages, the ability to mount versioned data sources, customizable compute resources like GPUs, and more.

在Notebooks IDE中,您可以访问在具有预安装包的Docker容器中运行的交互式会话,可以挂载版本化的数据源,可自定义的计算资源(如GPU)等。

从Kaggle笔记本中获取“更多”的提示 (Tips to get ‘More’ out of your Kaggle Notebooks)

Let’s now quickly jump into some of the tips that I keep in mind before attempting to create a new notebook.

现在,让我们快速尝试创建新笔记本之前要记住的一些技巧。

1.通过笔记本讲述引人入胜的故事 (1. Tell compelling stories through Notebooks)

“The purpose of a storyteller is not to tell you how to think, but to give you questions to think upon.”

“讲故事的人不是要告诉您如何思考,而是要给您思考的问题。”

Brandon Sanderson, The Way of Kings

布兰登·桑德森(Brandon Sanderson), 《国王之路》

Notebooks are an excellent tool to get your ideas across. They allow you to interactively explore data, create visualizations and then share the results with the world. In a way, you can combine both code and a writeup in the same environment. Use crisp visualizations to create a compelling storyline. Pick up a unique problem and try to work through it. Define the purpose of the notebook at the beginning itself and then wrap it up with a fitting conclusion, to create an impactful story.

笔记本是传播您的想法的绝佳工具。 它们使您可以交互式地探索数据,创建可视化效果,然后与世界共享结果。 从某种意义上讲,您可以在同一环境中将代码和文字结合起来。 使用清晰的可视化效果创建引人入胜的故事情节。 选择一个独特的问题,然后尝试解决它。 在开始时就定义笔记本的用途,然后用合适的结论将其包装起来,以创建一个具有影响力的故事。

In the notebook titled Geek Girls Rising: Myth or Reality! , I analysed the 2019 Kaggle ML and DS Survey data for Women’s Representation in Machine Learning and Data Science. The problem statement that I chose to work upon was whether the women participation in Kaggle was improving and how did it compare to the previous years. I created a report by analysing the data across various attributes like gender, countries, age groups etc. Finally, I concluded the analysis with key take ways and some recommendations.

在名为“ 极客女孩崛起:神话还是现实 ”的笔记本中 我分析了2019年Kaggle机器学习和DS调查数据,以妇女在机器学习和数据科学领域的代表性。 我选择要研究的问题陈述是,女性参与Kaggle的人数是否有所改善,以及与前几年相比如何? 我通过分析各种属性(例如性别,国家/地区,年龄段等)的数据来创建报告。最后,我以关键的采取方法和一些建议来结束分析。

Image for post
Geek Girls Rising: Myth or Reality!极客女孩崛起:神话还是现实!

2.与他人合作 (2. Collaborate with others)

“Many ideas grow better when transplanted into another mind than the one where they sprang up.”

“许多想法移植到另一种思想中后,会比它们产生的思想更好地成长。”

Oliver Wendell Holmes

奥利弗·温德尔·福尔摩斯

Collaboration is an integral part of Data Science, be it in research or the Open Source area. The importance of teaming up in Kaggle competitions cannot be emphasized enough. However, even the Kaggle notebooks have an extremely powerful collaboration feature. Multiple users can co-own and edit a Notebook. This could be helpful in a couple of scenarios.

无论是在研究领域还是在开源领域,协作都是数据科学不可或缺的一部分。 在Kaggle比赛中组队的重要性不能被足够强调。 但是,即使是Kaggle笔记本电脑也具有极其强大的协作功能。 多个用户可以共同拥有和编辑一个笔记本 这在两种情况下可能会有所帮助。

  • When you are taking part in competitions, you can collaboratively work on your code with your teammates, in the notebook.

    参加比赛时,您可以在笔记本中与队友一起协作编写代码。
  • When working on analyzing a dataset, you can collaborate with people and create impactful project reports.

    在分析数据集时,您可以与人们合作并创建有影响力的项目报告。

Creating, Reading & Writing Data”, a Notebook from the Advanced Pandas Kaggle Learn track, is one example of a great collaborative notebook.

Advanced Pandas Kaggle Learn曲目中的笔记本“ 创建,读取和写入数据 ”是出色的协作笔记本的一个示例。

Image for post
Collaborating through Kaggle Notebooks
通过Kaggle笔记本进行协作

3.通过入门笔记本为比赛做贡献 (3. Contributing to Competitions via starter notebooks)

Contribution is the key

贡献是关键

You want to contribute to the competitions but not ready to compete yet? Well, start writing starter notebooks when a new competition launches. Such notebooks are typically of two kinds:

您想为比赛做出贡献,但还没有准备好参加比赛吗? 好吧,当新比赛开始时,开始写入门笔记本。 此类笔记本通常有两种:

  • Notebooks which perform a basic or advanced Exploratory data analysis. These notebooks help others quickly understand the nature and pattern of data, thereby saving them a lot of time. Others highly appreciate an excellent EDA notebook.

    执行基本或高级探索性数据分析的笔记本。 这些笔记本可帮助其他人快速了解数据的性质和模式,从而节省大量时间。 其他人高度赞赏一款出色的EDA笔记本。
  • Notebooks containing quick baselines. Such notebooks act as a stepping stone for people looking to compete in the competitions. They can use the baseline to build upon their own analysis.

    包含快速基准的笔记本。 对于希望参加比赛的人来说,这种笔记本充当了垫脚石。 他们可以使用基线建立自己的分析。

Here is a glance at some of the EDA and starter notebooks for the competition: SIIM-ISIC Melanoma Classification for identifying melanoma in lesion images

以下是一些竞赛用的EDA和入门笔记本: SIIM-ISIC黑色素瘤分类法,用于在病变图像中识别黑色素瘤

Image for post
SIIM-ISIC Melanoma Classification competition Notebooks SIIM-ISIC黑色素瘤分类比赛笔记本

4.教一些新东西 (4. Teach something new)

The Best Way to Learn Something is to Teach it to Someone Else

最好的学习方法是教别人

Try teaching about a new library or some new functions. This is especially helpful for beginners who sometimes have difficulty following the official documentation. However, make sure, you use some new datasets to showcase the working of the libraries/functions. Duplicating the entire documentation as it is is not a good idea.

尝试教授新的库或一些新功能。 这对于有时难以遵守官方文档的初学者特别有用。 但是,请确保使用一些新的数据集来展示库/函数的工作。 复制整个文档不是一个好主意。

In the notebook Useful Python libraries for Data Science, I compiled some of the useful but lesser-known Python libraries which can really come in handy for the Data Analysis and Machine learning tasks.

在笔记本《数据科学的有用Python库》中 ,我编译了一些有用但鲜为人知的Python库,这些库对于数据分析和机器学习任务确实非常有用。

Image for post
Libraries covered in Kaggle Notebook: Useful Python libraries for Data Science
Kaggle Notebook中涵盖的 数据科学的有用Python库

Similarly, in the notebook, Advanced Pyspark for Exploratory Data Analysis, Tien Tran, showcases how to use PySpark and its advantages over Pandas for handling big data.

同样,在笔记本中, Ten Tran的 用于探索性数据分析的高级Pyspark展示了如何使用PySpark及其相对于熊猫的优势来处理大数据。

Image for post
Advanced Pyspark for Exploratory Data Analysis Kaggle Notebook 用于探索性数据分析的高级Pyspark Kaggle笔记本

5.当心数据可视化的陷阱 (5. Beware of the Data Visualization Pitfalls)

“The purpose of visualisation is insight and not pictures .”

“可视化的目的是洞察力而不是图片。”

Ben Shneiderman

本·史奈德曼

Image for post
data-to-viz.com data-to-viz.com收集的十大DataViz警告

A picture is definitely worth a thousand words, but too many pictures defeat the purpose of clarity. There are a few points that you should consider if you want to make visualizations that stand out and also help in the storyline.

一幅图片绝对值一千个字,但是太多的图片无法达到清晰的目的。 如果要使可视化脱颖而出并且对故事情节有所帮助,则应考虑几点。

  • Keep the visualizations simple and to the point.

    保持可视化的简单和重点。
  • Explain each chart concisely. Do not leave the charts to be interpreted by the reader. Make others aware of your point of view.

    简要说明每个图表。 不要让图表由读者解释。 让其他人了解您的观点。
  • Make clear and readable charts. Be mindful of the colour blind and make charts that can be interpreted by everyone.

    制作清晰易读的图表。 注意色盲,并制作每个人都可以理解的图表。
  • Do not overdo animations. Use it only if they fit in the storyline.

    不要过度动画。 仅当它们适合故事情节时才使用它。
  • Make the axis and labels clearly visible. Use proper font selection and font size. Display the legends and title clearly.

    使轴和标签清晰可见。 使用正确的字体选择和字体大小。 清楚显示图例和标题。

6.使您的笔记本具有可复制性 (6. Make your notebooks reproducible)

“Why should you care about reproducibility?

“为什么要关心可重复性?

Because the person most likely to need to reproduce your work… is you.”

因为最有可能需要复制您作品的人是您。”

Dr Rachael Tatman -Reproducible Machine Learning

Rachael Tatman博士-可再现的机器学习

A reproducible example allows someone else to recreate your analysis using the same data. This makes a lot of sense since you are putting your work out in the public for them to use. This purpose gets defeated if others cannot reproduce your work on or off Kaggle. Rachael Tatman has put a wonderful kernel on Reproducible research best practices which lists some of the best practices for doing reproducible work. Here are some of the tips from the above study:

一个可重现的示例允许其他人使用相同的数据来重新创建您的分析。 这很有意义,因为您将作品公开发布给他们使用。 如果其他人无法在Kaggle上或下复制您的作品,则无法达到此目的。 Rachael Tatman在可再现性研究最佳实践上做了一个精彩的内核,其中列出了进行可再现性工作的一些最佳实践。 以下是上述研究的一些技巧:

  • Put all your imports, import x or library(x) at the top of your notebook

    将所有导入,导入x或库(x)放在笔记本顶部
  • Break up long lines at logical places

    在合乎逻辑的地方打断长队
  • Make your variable names sensible and human-readable

    使您的变量名称合理并易于理解
  • Comment your code!

    注释您的代码!
  • make sure to set all the random number generators (RNGs)

    确保设置所有随机数生成器(RNG)
  • Ensure that both your code and data are logically organised.

    确保代码和数据都按照逻辑进行组织。

The following example(again taken from Rachel’s slides) clearly emphasise the power of writing clean, modular and reproducible code.

以下示例(同样摘自Rachel的幻灯片)清楚地强调了编写干净,模块化和可再现的代码的强大功能。

Image for post
Source 资源

7.留意错误 (7. Keep an eye on the Errors)

“One man’s crappy software is another man’s full-time job.”

“一个人糟糕的软件是另一个人的全职工作。”

Jessica Gaston

杰西卡·加斯顿(Jessica Gaston)

Run your entire notebooks before publishing. A notebook containing errors or graphs that do not render isn’t something that you would want to share with the world, forget about getting upvotes. Also, make sure the dataset connected to the notebook is also error-free.

发布前先运行整个笔记本。 包含无法呈现的错误或图形的笔记本不是您想与世人分享的东西,忘记了投票。 另外,确保连接到笔记本的数据集也没有错误。

8.在分叉之前先投票 (8. Upvote before you Fork)

The term “Forking” comes from version control. Forking a notebook means to make a copy of it as it currently is. It is a common tendency for people to fork good notebooks or baselines to build up their code on them. However, some people will fork a notebook or use other’s work but will not show an appreciation by upvoting the original work. If you have found somebody’s code to be so useful that you ended up using it, why not show the author some gratitude?

术语“分叉”来自版本控制 。 分叉笔记本意味着要照原样复制它。 人们通常会派出好的笔记本或基准来在其上建立代码。 但是,有些人会叉一个笔记本或使用其他人的作品,但不会通过批评原始作品而表示赞赏。 如果您发现某人的代码非常有用,以致您最终使用了它,那么为什么不对作者表示感谢呢?

The notebook highlighted below has more forks than upvotes. Strange!

下面突出显示的笔记本具有更多的叉子而不是赞誉。 奇怪!

[inference, PyTorch] Birdcall ResNet Baseline by Hidehisa Arai
[inference, PyTorch] Birdcall ResNet Baseline by Hidehisa Arai
[推论,PyTorch] 荒井秀 久的Birdcall ResNet基线

9.关注笔记本礼节 (9. Follow Notebook Etiquettes)

Getting Inspired is Human, but Plagiarising is evil

受到启发是人类,但是抄袭是邪恶的

  • Do not lift content directly from other notebooks. If you think you want to reuse a chunk of code, give clear attribution.

    不要直接从其他笔记本中取出内容。 如果您想重用大量代码,请提供明确的归属信息。
  • Refrain from Spamming the notebooks by asking for votes. A good notebook will get the eye of the fellow Kagglers. You can also put your content out there on your social media like Linkedin to Twitter to tell the world that you have created a new notebook. But do not keep asking for votes, especially in return for the vote that you have given.

    避免通过向笔记本发送垃圾邮件来进行投票。 一个好的笔记本将引起其他Kagglers的注意。 您还可以将自己的内容放到诸如Linkedin到Twitter的社交媒体上,以向全世界说明您已经创建了一个新笔记本。 但是,不要一直要求投票,尤其是要换回您所投的票。

10.对良好的工作表示赞赏 (10. Show Appreciation for good work)

Showing appreciation is one of the simplest yet one of the most powerful things humans can do for each othe

表示赞赏是人类可以为每个事物做的最简单但最强大的事情之一

Finally, do not shy away from appreciating great notebooks by upvoting them and giving them a shoutout. Here’s a great example of appreciating other’s work by Kaggle GM — Head or Tails. He periodically showcases Kaggle Notebooks, which he feels haven’t gotten their due.

最后,不要回避通过赞扬和赞扬伟大的笔记本而大声疾呼。 这是一个由Kaggle GM欣赏他人的作品的很好的例子-Head 或Tails 。 他定期展示Kaggle笔记本,他觉得还没有得到应有的帮助。

Image for post
Hidden Gems Notebook post as presented by Heads or Tails 隐藏的宝石笔记本帖子,由负责人或尾巴提供

结论 (Conclusion)

Kaggle Notebooks are a great tool to get your thoughts across. Search or curate some cool datasets and use notebooks to create some outstanding analysis. In the end, do not forget to enjoy the process. There is so much to learn from the fantastic Kaggle community out there. But the most important thing is to attempt — for the secret of getting ahead is getting started.

Kaggle笔记本是传达您的想法的绝佳工具。 搜索或整理一些很酷的数据集,并使用笔记本创建一些出色的分析。 最后,不要忘记享受这个过程。 从梦幻般的Kaggle社区可以学到很多东西。 但是最重​​要的是尝试-因为成功的秘诀是开始。

This story was originally published here.

这个故事最初是 在这里 发表的

翻译自: https://towardsdatascience.com/getting-more-out-of-your-kaggle-notebooks-fb2530ece942

多分类 kaggle

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值