现有学生课程数据库中三个表_学生在开始数据科学时犯的三个错误

现有学生课程数据库中三个表

I recently had the pleasure of sitting as a panelist at Northeastern University’s COVID-19 UNCOVER datathon. The aim was simple: produce actionable insights and/or predictive models from open source COVID-19 data to help the community make better decisions.

最近,我很高兴成为东北大学的COVID-19 UNCOVER datathon的小组成员。 目的很简单:从开源COVID-19数据中得出可行的见解和/或预测模型,以帮助社区做出更好的决策。

I want to put down some of the common mistakes I noticed students made with resources on where they can find more help where possible.

我想写下一些我发现的常见错误,这些错误是学生利用资源在可能的地方找到更多帮助的。

忽略良好可视化的价值 (Ignoring the value of a good visualization)

I strongly believe that the value of your analysis and modeling efforts are not fruitful without meaningful visualization. Visualization need not be novel but must be thoughtful and complete. Some tips to improve visualization:

我坚信,如果没有有意义的可视化,分析和建模工作的价值将无法产生成果。 可视化不必新颖,而必须周到且完整。 改善可视化的一些技巧:

  1. Adding axis labels and plot title: Say no to “x” and “y” axis labels and df$variable titles! This is like going to a party in your sweatpants, nothing wrong with it, but you may want to rethink it.

    添加轴标签和绘图标题:对“ x”和“ y”轴标签和df $ variable标题说不! 这就像穿着运动裤参加聚会,这没什么错,但是您可能需要重新考虑一下。

2. Rainbow colors are a big no! Colors can make the plot visually stimulating, but too much is never good. The downfall of the rainbow palette has been documented several times and countless posts go over this in more detail, a simple Google search will uncover numerous articles.My rule of thumb: Use color sparingly and bring a certain aspect of the viewer’s visualization to attention.

2.彩虹色很大! 颜色可以使情节在视觉上令人兴奋,但是太多永远都不是一件好事。 彩虹调色板的崩溃已经被记录了好几次,无数的帖子对此进行了详细介绍,一个简单的Google搜索将发现许多文章。

Image for post
Although the rainbow palette is richer in color, black-and-white is much more valuable. Rainbow is confusion and requires the viewer to constantly look at the legend and back to understand the plot. Image by Author.
尽管彩虹调色板的颜色更丰富,但黑白更有价值。 Rainbow是混乱的,需要观众不断查看图例并返回以了解情节。 图片由作者提供。

The post below goes into some detail to show how you can use some easy tools to pick better colors, colorbrewer2.org is a website I often use.

下面的帖子详细介绍了如何使用一些简单的工具来选择更好的颜色, colorbrewer2.org是我经常使用的网站。

3. Choosing the right visualization: This may sound like a “duh, genius!” point, but it is worthwhile to think of a plot in the sense of conveying a message rather than conveying an observation from the data. The post below is a good starting point. Tip: Try to think of less ideal visualizations for the examples presented.

3.选择正确的可视化:这听起来像是“天才,天才!” 要点,但是从传达信息而不是传达数据观察的意义上考虑图是值得的。 下面的帖子是一个很好的起点。 提示:尝试为给出的示例考虑不太理想的可视化效果。

Finally, if you are a Northeastern student I would highly recommend taking up a visualization course — like DS5500.

最后,如果您是东北学生,我强烈建议您参加可视化课程-DS5500。

简报很无聊,故事很有趣 (Powerpoints are boring, stories are interesting)

Storytelling is incredibly powerful to help the audience understand your perspective on the data and problem you are trying to solve. I like to take my audience on a journey through the data, showing them where I faced trouble and backing my decisions with solid data visualizations. A simple structure to start with:

讲故事的功能非常强大,可以帮助听众了解您对数据和要解决的问题的看法。 我喜欢引导观众浏览数据,向他们展示我遇到的问题,并通过可靠的数据可视化来支持我的决策。 一个简单的结构开始于:

  1. Aim of the analysis: Clearly state what you are trying to achieve and a small point and why this would be useful for someone.

    分析的目的:清楚说明您要实现的目标和一个小问题,以及为什么这对某人有用。
  2. Data: A small snippet (the head) of the data is useful in aligning the audience with the problem at hand. When using large datasets, summary statistics of the important variables can be helpful.

    数据:一小段数据(头部)有助于使听众与眼前的问题保持一致。 使用大型数据集时,重要变量的摘要统计信息可能会有所帮助。
  3. Analysis, methods, and results: Walk the audience through some of the analysis you perform and modeling methods used, supplement with visualizations where necessary. Show outliers, inconsistencies, and missingness in the variables and how you deal with the same.

    分析,方法和结果:引导观众完成您执行的一些分析和建模方法,并在必要时进行可视化补充。 在变量中显示异常值,不一致和缺失以及如何处理它们。
  4. Conclusion: This is much simpler than you imagine, reiterate the results, and why they are meaningful. I like to think of this section as a recap of my presentation in a slide or two.

    结论:这比您想像的要简单得多,重申了结果以及它们为什么有意义的原因。 我喜欢把本节当作是我的一两张幻灯片的回顾。

Practice makes perfect, I have found YouTube to be a great resource. Watch how presenters go over the analysis and try to think if there is another way you would convey the same message.

实践使之完美,我发现YouTube是一个很好的资源。 观察演示者如何进行分析,并尝试思考是否存在另一种传达相同信息的方式。

Final thought: The presenter is like the shepherd guiding the audience (the sheep) through the data to a conclusion. You don’t want the sheep to be lost when you arrive at the conclusion.

最后的想法:主持人就像牧羊人通过数据引导观众(羊)得出结论。 您不希望得出结论时迷失羊群。

跳到建模太快 (Jumping to modeling too fast)

Starting with data science I too was eager to fit models and present the highest accuracy on the test set. But over time I have realized that a model is only as good as the data used to build the model. Exploratory data analysis is not only a way for you to better understand the data but also to way to better communicate to the audience why certain decisions are made. 70% of my time as a data scientist is spent cleaning up data and spotting inconsistencies.

从数据科学开始,我也非常渴望拟合模型并在测试集上展现最高的准确性。 但是随着时间的推移,我已经意识到一个模型只和建立模型所用的数据一样好 。 探索性数据分析不仅是您更好地了解数据的一种方式,而且还是一种可以更好地向受众传达某些决策原因的方法。 作为数据科学家,我有70%的时间用于清理数据和发现不一致之处。

A few things to think about before modeling:

建模之前需要考虑的几件事:

  1. Multicollinearity and confounding variables: Data is often filled with noise and confounders. An easy method to see what variables you can eliminate is to look at correlation heatmaps and reason about keeping/removing variables that show a clear correlation. This, in my opinion, gets 80% of the job done for you and your audience.

    多重共线性和混杂变量:数据经常充满噪声和混杂因素。 了解可以消除哪些变量的一种简单方法是查看关联热图以及保持/删除显示清晰关联的变量的原因。 我认为,这可以为您和您的听众完成80%的工作。
  2. Data is messy! Clean up outliers, and think about how you can handle missing values. Real-world data is never as you expect it to be, which can easily throw off your analysis and bring out incorrect conclusions.

    数据很乱! 清理异常值,并考虑如何处理缺失值。 现实世界中的数据从未像您所期望的那样容易,这很容易使您的分析中断并得出错误的结论。
Image for post
This is NOT data science. Source: https://xkcd.com/1838/
这不是数据科学。 资料来源: https : //xkcd.com/1838/

In conclusion, focus on your audience and present a story from the data. Use the right visualization to aid your story and validate the variables you pick to create models. Keep these tips in mind and with practice, you will learn to be a better data scientist.

总而言之,请专注于您的听众,并根据数据展示一个故事。 使用正确的可视化来辅助您的故事,并验证您选择用来创建模型的变量。 牢记这些技巧,并通过实践,您将学会成为一名更好的数据科学家。

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/3-mistakes-students-make-when-starting-data-science-aa6081733a54

现有学生课程数据库中三个表

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值