有抱负的数据科学家？掌握这些基础知识。

最新推荐文章于 2024-09-27 10:47:12 发布

cumi6497

最新推荐文章于 2024-09-27 10:47:12 发布

阅读量571

点赞数

文章标签：可视化大数据编程语言 python 机器学习

原文链接：https://www.freecodecamp.org/news/aspiring-data-scientist-master-these-fundamentals-be7c54350868/

版权

Data science is an exciting, fast-moving field to become involved in. There’s no shortage of demand for talented, analytically-minded individuals. Companies of all sizes are hiring data scientists, and the role provides real value across a wide range of industries and applications.

数据科学是一个令人兴奋的，瞬息万变的领域，需要投入。对有才华，具有分析头脑的个人的需求并不缺乏。各种规模的公司都在聘用数据科学家，该职位为广泛的行业和应用提供了真正的价值。

Often, people’s first encounters with the field come through reading sci-fi headlines generated by major research organizations. Recent progress has raised the prospect of machine learning transforming the world as we know it within a generation.

通常，人们与该领域的第一次接触是通过阅读主要研究组织产生的科幻标题。最近的进展提高了机器学习在一代人的时间内改变世界的前景。

However, outside of academia and research, data science is about much more besides headline topics such as deep learning and NLP.

但是，除了学术和研究之外，数据科学除了深度学习和NLP之类的头条话题之外，还有很多其他内容。

Much of the commercial value of a data scientist comes from providing the clarity and insights that vast quantities of data can bring. The role can encompass everything from data engineering, to data analysis and reporting — with maybe some machine learning thrown in for good measure.

数据科学家的大部分商业价值来自提供大量数据可以带来的清晰度和见解。这个角色可以涵盖从数据工程到数据分析和报告的所有内容，并可能附带了一些很好的机器学习知识。

This is especially the case at a startup firm. Early and mid-stage companies’ data needs are typically far removed from the realm of neural networks and computer vision. (Unless, of course, these are core features of their product/service).

在初创公司尤其如此。早期和中期公司的数据需求通常与神经网络和计算机视觉领域相去甚远。 (当然，除非这些是其产品/服务的核心功能)。

Rather, they need accurate analysis, reliable processes, and the ability to scale fast.

相反，他们需要准确的分析，可靠的过程以及快速扩展的能力。

Therefore, the skills required for many advertised data science roles are broad and varied. Like any pursuit in life, much of the value comes from mastering the basics. The fabled 80:20 rule applies — approximately 80% of the value comes from 20% of the skillset.

因此，许多广告数据科学角色所需的技能广泛而多样。就像生活中的任何追求一样，很多价值都来自于掌握基础知识。寓言中的80:20规则适用-大约80％的价值来自技能组的20％。

Here’s an overview of some of the fundamental skills that any aspiring data scientist should master.

这是任何有抱负的数据科学家都应掌握的一些基本技能的概述。

从统计开始 (Start with statistics)

The main attribute a data scientist brings to their company is the ability to distill insight from complexity. Key to achieving this is understanding how to uncover meaning from noisy data.

数据科学家带给他们公司的主要特征是能够从复杂性中提取洞察力。实现这一目标的关键是理解如何从嘈杂的数据中发现含义。

Statistical analysis is therefore an important skill to master. Stats lets you:

因此，统计分析是掌握的一项重要技能。统计信息可让您：

Describe data, to provide a detailed picture to stakeholders
描述数据，以向利益相关者提供详细信息
Compare data and test hypotheses, to inform business decisions
比较数据并检验假设，以告知业务决策
Identify trends and relationships that provide real predictive value
识别提供真正预测价值的趋势和关系

Statistics provides a powerful set of tools for making sense of commercial and operational data.

统计数据提供了一套功能强大的工具，可用来理解商业和运营数据。

But be wary! The one thing worse than limited insights are misleading insights. This is why it is vital to understand the fundamentals of statistical analysis.

但是要小心！比有限的见解更糟糕的一件事是误导性见解。这就是为什么至关重要的是要了解统计分析的基础。

Fortunately, there are a few guiding principles you can follow.

幸运的是，您可以遵循一些指导原则。

评估你的假设 (Assess your assumptions)

It’s very important to be aware of assumptions you make about your data.

注意您对数据所做的假设非常重要。

Always be critical of provenance, and skeptical of results. Could there be an ‘uninteresting’ explanation for any observed trends in your data? How valid is your chosen stats test or methodology? Does your data meet all the underlying assumptions?

始终要批评出处，并对结果表示怀疑。对于您的数据中观察到的趋势，可能会有“有趣的”解释吗？您选择的统计测试或方法的有效性如何？您的数据是否符合所有基本假设？

Knowing which findings are ‘interesting’ and worth reporting also depends upon your assumptions. An elementary case in point is judging whether it is more appropriate to report the mean or the median of a data set.

知道哪些发现是“有趣的”并值得报告也取决于您的假设。一个基本的例子就是判断是否更适合报告数据集的平均值或中位数。

Often more important than knowing which approach to take, is knowing which not to. There are usually several ways to analyze a given set of data, but make sure to avoid common pitfalls.

往往比知道哪种方法采取更重要的，是知道这不是。通常有几种方法可以分析给定的数据集，但要确保避免常见的陷阱。

For instance, multiple comparisons should always be corrected for. Under no circumstances should you seek to confirm a hypothesis using the same data used to generate it! You’d be surprised how easily this is done.

例如，应该始终对多个比较进行校正。在任何情况下，您都不应使用与生成假设相同的数据来确认假设！您会惊讶地发现这样做很容易。

分布>位置 (Distribution > Location)

Whenever I talk about introductory statistics, I always make sure to emphasize a particular point: the distribution of a variable is usually at least as interesting/informative as its location. In fact, it is often more so.

每当我谈论入门统计时，我总是要确保强调一点：变量的分布通常至少与其位置一样有趣/提供信息。实际上，通常情况更是如此。

This is because the distribution of a variable usually contains information about the underlying generative (or sampling) processes.

这是因为变量的分布通常包含有关基础生成(或采样)过程的信息。

For example, count data often follows a Poisson distribution, whereas a system exhibiting positive feedback (“reinforcement”) will tend to surface a power law distribution. Never rely on data being normally distributed without first checking carefully.

例如，计数数据通常遵循泊松分布，而表现出正反馈(“增强”)的系统将倾向于呈现幂律分布。未经仔细检查，切勿依赖于正态分布的数据。

Secondly, understanding the distribution of the data is essential for knowing how to work with it! Many statistical tests and methods rely upon assumptions about how your data are distributed.

其次，了解数据的分布对于了解如何使用它至关重要！许多统计测试和方法都依赖于有关数据分布方式的假设。

As a contrived example, always be sure to treat unimodal and bimodal data differently. They may have the same mean, but you’d lose a whole ton of important information if you disregard their distributions.

作为一个人为的示例，请始终确保对单峰和双峰数据进行不同的处理。它们的平均值可能相同，但是如果不考虑它们的分布，您将丢失大量的重要信息。

For a more interesting example that illustrates why you should always check your data before reporting summary statistics, take a look at Anscombe’s quartet:

有关一个更有趣的示例，该示例说明了为什么在报告摘要统计信息之前应始终检查数据的原因，请看一下Anscombe的四重奏：

Each graph looks very distinctive, right? Yet each has identical summary statistics — including their means, variance and correlation coefficients. Plotting some of the distributions reveals them to be rather different.

每个图看起来都很有特色，对吧？然而，每个人都有相同的摘要统计信息-包括均值，方差和相关系数。绘制一些分布会发现它们有很大的不同。

Finally, the distribution of a variable determines the certainty you have about its true value. A ‘narrow’ distribution allows higher certainty, whereas a ‘wide’ distribution allows for less.

最后，变量的分布确定了您对变量真实值的确定性。 “窄”分布允许较高的确定性，而“宽”分布允许较小的确定性。

The variance about a mean is crucial to provide context. All too often, means with very wide confidence intervals are reported alongside means with very narrow confidence intervals. This can be misleading.

关于均值的方差对于提供上下文至关重要。通常，会同时报告置信区间非常宽的均值和置信区间非常窄的均值。这可能会产生误导。

合适的采样 (Suitable sampling)

The reality is that sampling can be a pain point for commercially oriented data scientists, especially for those with a background in research or engineering.

现实情况是，采样可能是面向商业的数据科学家的痛点，尤其是对于那些具有研究或工程背景的科学家而言。

In a research setting, you can fine-tune precisely designed experiments with many different factors and levels and control treatments. However, ‘live’ commercial conditions are often suboptimal from a data collection perspective. Every decision must be carefully weighed up against the risk of interrupting ‘business-as-usual’.

在研究环境中，您可以微调经过精确设计的实验，其中包括许多不同的因素和水平以及对照治疗方法。但是，从数据收集的角度来看，“实时”商业条件通常不是最理想的。必须仔细权衡每项决定，以防干扰“一切照旧”的风险。

This requires data scientists to be inventive, yet realistic, with their approach to problem-solving.

这就要求数据科学家在解决问题的方法上要具有创造力，但要切合实际。

A/B testing is a canonical example of an approach that illustrates how products and platforms can be optimized at a granular level without causing major disturbance to business-as-usual.

A / B测试是该方法的典型示例，该方法说明了如何在不对日常业务造成重大干扰的情况下，在粒度级别上优化产品和平台。

Bayesian methods may be useful for working with smaller data sets, if you have a reasonably informative set of priors to work from.

如果您有相当丰富的先验信息集，贝叶斯方法可能适用于处理较小的数据集。

With any data you do collect, be sure to recognize its limitations.

对于您收集的任何数据，请确保意识到其局限性。

Survey data is prone to sampling bias (often it is respondents with the strongest opinions who take the time to complete the survey). Time series and spatial data can be affected by autocorrelation. And last but not least, always watch out for multicollinearity when analyzing data from related sources.

调查数据容易出现抽样偏差(通常是意见最强烈的受访者会花时间完成调查)。时间序列和空间数据可能会受到自相关的影响。最后但并非最不重要的一点是，在分析来自相关来源的数据时，请始终注意多重共线性。

数据工程 (Data Engineering)

It’s something of a data science cliché, but the reality is that much of the data workflow is spent sourcing, cleaning and storing the raw data required for the more insightful upstream analysis.

这有点像数据科学陈词滥调，但事实是，大部分数据工作流都花在了采购，清理和存储更深入的上游分析所需的原始数据上。

Comparatively little time is actually spent implementing algorithms from scratch. Indeed, most statistical tools come with their inner workings wrapped up in neat R packages and Python modules.

实际上，从头开始实施算法的时间相对较少。确实，大多数统计工具的内部工作方式都包装在简洁的R包和Python模块中。

The ‘extract-transform-load’ (ETL) process is critical to the success of any data science team. Larger organizations will have dedicated data engineers to meet their complex data infrastructure requirements, but younger companies will often depend upon their data scientists to possess strong, all-round data engineering skills of their own.

“ 提取-转换-加载 ”(ETL)过程对于任何数据科学团队的成功至关重要。大型组织将拥有专门的数据工程师来满足其复杂的数据基础结构要求，但是年轻的公司通常将依靠其数据科学家来拥有自己强大的，全面的数据工程技能。

实践编程 (Programming in practice)

Data science is highly inter-disciplinary. As well as advanced analytical skills and domain-specific knowledge, the role also necessitates solid programming skills.

数据科学是高度跨学科的。除了高级分析技能和特定领域知识外，该角色还需要扎实的编程技能。

There is no perfect answer to which programming languages an aspiring data scientist should learn to use. That said, at least one of Python and/or R will serve you very well.

对于有抱负的数据科学家应该学习使用哪种编程语言，没有完美的答案。就是说， Python和/或R中的至少一个会很好地为您服务。

Whichever language you opt for, aim to become familiar with all its features and the surrounding ecosystem. Browse the various packages and modules available to you, and set up your perfect IDE. Learn the APIs you’ll need to use for accessing your company’s core platforms and services.

无论您选择哪种语言，都旨在熟悉其所有功能和周围的生态系统。浏览各种可用的软件包和模块，并设置完善的IDE。了解访问公司核心平台和服务所需的API。

Databases are an integral piece in the jigsaw of any data workflow. Be sure to master some dialect of SQL. The exact choice isn’t too important, because switching between them is a manageable process when necessary.

数据库是任何数据工作流程中不可或缺的一部分。确保掌握一些SQL方言。确切的选择不是太重要，因为在必要时在它们之间进行切换是一个可管理的过程。

NoSQL databases (such as MongoDB) may also be worth learning about, if your company uses them.

如果您的公司使用NoSQL数据库(例如MongoDB )，也可能值得学习。

Becoming a confident command line user will go a long way to boosting your day-to-day productivity. Even passing familiarity with simple bash scripting will get you off to a strong start when it comes to automating repetitive tasks.

成为一个自信的命令行用户将大大提高您的日常工作效率。即使是熟悉简单的bash脚本，也可以使您自动执行重复性任务。

有效编码 (Effective coding)

A very important skill for aspiring data scientists to master is coding effectively. Reusability is key. It is worth taking the time (when it is available) to write code at a level of abstraction that enables it to be used more than once.

有抱负的数据科学家掌握的一项非常重要的技能是有效地编码。可重用性是关键。值得花时间(在可用时)以抽象级别编写代码，以使其能够多次使用。

However, there is a balance to be struck between short and long-term priorities.

但是，在短期和长期优先事项之间要取得平衡。

There’s no point taking twice as long to write an ad hoc script to be reusable if there’s no chance it’ll ever be relevant again. Yet every minute spent refactoring old code to be rerun is a minute that could have been saved previously.

如果没有机会再次使用临时脚本，则只需花费两倍的时间就可以重复使用它。但是，重构旧代码以重新运行所花费的每一分钟都是以前可以保存的一分钟。

Software engineering best practices are worth developing in order to write truly performant production code.

为了编写真正高效的生产代码，值得开发软件工程最佳实践。

Version management tools such as Git make deploying and maintaining code much more streamlined. Task schedulers allow you to automate routine processes. Regular code reviews and agreed documentation standards will make life much easier for your team’s future selves.

Git等版本管理工具使部署和维护代码更加简化。任务计划程序使您可以自动执行常规流程。定期的代码审查和公认的文档标准将使您的团队将来的工作更加轻松。

In any line of tech specialization, there’s usually no need to reinvent the wheel. Data engineering is no exception. Frameworks such as Airflow make scheduling and monitoring ETL processes easier and more robust. For distributed data storage and processing, there are Apache Spark and Hadoop.

在任何技术专业领域中，通常都不需要重新发明轮子。数据工程也不例外。诸如Airflow之类的框架使调度和监视ETL过程变得更加轻松和强大。对于分布式数据存储和处理，有Apache Spark和Hadoop 。

It isn’t essential for a beginner to learn these in great depth. Yet, having an awareness of the surrounding ecosystem and available tools is always an advantage.

对于初学者来说，深入学习这些知识不是必需的。但是，了解周围的生态系统和可用工具始终是一个优势。

清晰沟通 (Communicate clearly)

Data science is a full stack discipline, with an important stakeholder-facing front end: the reporting layer.

数据科学是一门全栈学科，具有面向利益相关者的重要前端：报告层。

The fact of the matter is simple — effective communication brings with it significant commercial value. With data science, there are four aspects to effective reporting.

问题的事实很简单-有效的交流带来了巨大的商业价值。借助数据科学，有效报告有四个方面。

Accuracy
准确性

This is crucial, for obvious reasons. The skill here is knowing how to interpret your results, while being clear about any limitations or caveats that may apply. It’s important not to over or understate the relevance of any particular result.
出于明显的原因，这至关重要。这里的技能是知道如何解释您的结果，同时清楚可能存在的任何限制或警告。重要的是不要过分或低估任何特定结果的相关性。
Precision
精确

This matters, because any ambiguity in your report could lead to misinterpretation of the findings. This may have negative consequences further down the line.
这很重要，因为报告中的任何歧义都可能导致对结果的误解。这可能会对线下产生负面影响。
Concise
简洁

Keep your report as short as possible, but no shorter. A good format might provide some context for the main question, include a brief description of the data available, and give an overview of the ‘headline’ results and graphics. Extra detail can (and should) be included in an appendix.
报表应尽可能短，但不要短。好的格式可能会为主要问题提供一些背景信息，包括对可用数据的简短描述，并概述“标题”结果和图形。可以(并且应该)在附录中包含更多详细信息。
Accessible
无障碍

There’s a constant need to balance the technical accuracy of a report with the reality that most of its readers will be experts in their own respective fields, and not necessarily data science. There’s no easy, one-size-fits-all answer here. Frequent communication and feedback will help establish an appropriate equilibrium.
始终需要平衡报告的技术准确性与现实，即大多数读者将是各自领域的专家，而不一定是数据科学。这里没有简单，一刀切的答案。频繁的沟通和反馈将有助于建立适当的平衡。

图形游戏 (The Graphics Game)

Powerful data visualizations will help you communicate complex results to stakeholders effectively. A well-designed graph or chart can reveal in a glance what several paragraphs of text would be required to explain.

强大的数据可视化将帮助您将复杂的结果有效地传达给利益相关者。精心设计的图形或图表可以一目了然地显示需要解释几段文字。

There’s a wide range of free and paid-for visualization and dashboard building tools out there, including Plotly, Tableau, Chartio, d3.js and many others.

有各种各样的免费和付费的可视化和仪表板构建工具，包括Plotly，Tableau，Chartio，d3.js 等。

For quick mock-ups, sometimes you can’t beat good ol’ fashioned spreadsheet software such as Excel or Google Sheets. These will do the job as required, although lack the functionality of purpose-built visualization software.

对于快速的模型制作，有时您无法击败优质的电子表格软件，例如Excel或Google表格。尽管缺少专用可视化软件的功能，但这些将按要求完成工作。

When building dashboards and graphics, there are a number of guiding principles to consider. The underlying challenge is to maximize the information value of the visualization, without sacrificing ‘readability’.

在构建仪表盘和图形时，需要考虑许多指导原则。潜在的挑战是在不牺牲“可读性”的情况下最大化可视化的信息价值。

An effective visualization reveals a high-level overview at a quick glance. More complex graphics may take a little longer for the viewer to digest, and should accordingly offer much greater information content.

有效的可视化功能可快速概览高层概览。观看者消化较复杂的图形可能需要更长的时间，因此应提供更多的信息内容。

If you only ever read one book about data visualization, then Edward Tufte’s classic The Visual Display of Quantitative Information is the outstanding choice.

如果您只读过一本有关数据可视化的书，那么爱德华·塔夫特(Edward Tufte)的经典著作《定量信息的视觉显示》 是杰出的选择。

Tufte single-handedly popularized and invented much of the field of data visualization. Widely used terms such as ‘chartjunk’ and ‘data density’ owe their origins to Tufte’s work. His concept of the ‘data-ink ratio’ remains influential over thirty years on.

Tufte单手推广并发明了许多数据可视化领域。诸如' chartjunk '和'data density'等广泛使用的术语源于Tufte的工作。他的“ 数据墨水比率 ”的概念在过去的30年中一直很有影响力。

The use of color, layout and interactivity will often make the difference between a good visualization and a high-quality, professional one.

颜色，布局和交互性的使用通常会在出色的可视化效果与高质量，专业的可视化效果之间产生区别。

Ultimately, creating a great data visualization touches upon skills more often associated with UX and graphic design than data science. Reading around these subjects in your free time is a great way to develop an awareness for what works and what doesn’t.

最终，创建出色的数据可视化效果会接触到与UX和图形设计相关的技能，而不是数据科学。在您的空闲时间阅读这些主题是一种了解什么是行之有效的好方法。

Be sure to check out sites such as bl.ocks.org for inspiration!

请务必查看bl.ocks.org等网站以获得灵感！

数据科学需要多样化的技能 (Data science requires a diverse skillset)

There are four core skill areas in which you, as an aspiring data scientist, should focus on developing. They are:

作为有抱负的数据科学家，您应在四个核心技能领域中专注于开发。他们是：

Statistics, including both the underlying theory and real world application.
统计信息，包括基础理论和实际应用。
Programming, in at least one of Python or R, as well as SQL and using the command line
使用Python或R中的至少一种以及SQL进行编程，并使用命令行
Data engineering best practices
数据工程最佳实践
Communicating your work effectively
有效沟通您的工作

奖金！不断学习 (Bonus! Learn constantly)

If you have read this far and feel at all discouraged — rest assured. The main skill in such a fast-moving field is learning how to learn and relearn. No doubt new frameworks, tools and methods will emerge in coming years.

如果您已读完此书并感到沮丧，请放心。在这个快速发展的领域中，主要技能是学习如何学习和重新学习。毫无疑问，未来几年将出现新的框架，工具和方法。

The exact skillset you learn now may need to be entirely updated within five to ten years. Expect this. By doing so, and being prepared, you can stay ahead of the game through continuous relearning.

您现在学习的确切技能可能需要在五到十年内完全更新。期待这个。通过这样做并做好准备，您可以通过不断的重新学习保持领先地位。

You can never know everything, and the truth is — no one ever does. But, if you master the fundamentals, you’ll be in a position to pick up anything else on a need-to-know basis.

您永远不可能知道所有事情，而事实是–从来没有人知道。但是，如果您掌握了基础知识，就可以根据需要知道其他内容。

And that is arguably the key to success in any fast developing discipline.

可以说，这是在任何快速发展的学科中取得成功的关键。