数据清理最终实现了自动化

Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

编者按:迈向数据科学播客的“攀登数据科学阶梯”系列由杰里米·哈里斯(Jeremie Harris)主持。 杰里米(Jeremie)帮助运营一家名为 SharpestMinds 的数据科学指导创业公司 您可以收听以下播客:

It’s cliché to say that data cleaning accounts for 80% of a data scientist’s job, but it’s directionally true.

俗话说,数据清理工作占数据科学家工作的80%,但这在方向上是正确的。

That’s too bad, because fun things like data exploration, visualization and modelling are the reason most people get into data science. So it’s a good thing that there’s a major push underway in industry to automate data cleaning as much as possible.

太糟糕了,因为诸如数据探索,可视化和建模之类的有趣事物是大多数人进入数据科学的原因。 因此,业界正在大力推动尽可能自动执行数据清理的一件好事。

One of the leaders of that effort is Ihab Ilyas, a professor at the University of Waterloo and founder of two companies, Tamr and Inductiv, both of which are focused on the early stages of the data science lifecycle: data cleaning and data integration. Ihab knows an awful lot about data cleaning and data engineering, and has some really great insights to share about the future direction of the space — including what work is left for data scientists, once you automate away data cleaning.

这项工作的领导者之一是滑铁卢大学的教授,两家公司Tamr和Inductiv的创始人Ihab Ilyas,​​这两家公司都致力于数据科学生命周期的早期阶段:数据清理和数据集成。 艾哈布(Ihab)对数据清理和数据工程知识非常了解,并且对于共享空间的未来方向具有真正的深刻见解,包括一旦您将数据清理自动化后将为数据科学家留下的工作。

Here were some of my biggest takeaways from the conversation:

以下是这次对话中我最大的收获:

  • Data cleaning involves a lot of things, one of which is dealing with missing values. Historically, missing values have often been filled in manually by subject matter experts who can make educated guesses about the data, but automated techniques can work well (and usually do better) at scale.

    数据清理涉及很多事情,其中​​之一就是处理缺失的值。 从历史上看,缺少的值通常是由主题专家手动填充的,他们可以对数据进行有根据的猜测,但是自动化技术可以很好地发挥作用(并且通常做得更好)。
  • These automated strategies can range from fairly naive approaches (e.g. replacing a value with the median or average value of other points in the dataset), to more sophisticated techniques (e.g. using a predictive model to guess at missing values).

    这些自动化策略的范围从相当幼稚的方法(例如,用数据集中其他点的中位数或平均值替换一个值)到更复杂的技术(例如,使用预测模型来猜测缺失值)。
  • The distinction between different parts of the data science lifecycle are often arbitrary, but clearly defining the boundaries between data cleaning, data exploration and modelling is nonetheless essential to ensure that problems can be solved in a contained and modular fashion. This idea is one part of the data science best practices that make up DataOps, a topic we’ve discussed on the podcast before.

    数据科学生命周期的不同部分之间的区分通常是任意的,但是清楚地定义数据清理,数据探索和建模之间的界限对于确保可以以封闭和模块化的方式解决问题至关重要。 这个想法是构成DataOps的数据科学最佳实践的一部分,这是我们之前在播客上讨论的主题。
  • It’s clear that data cleaning, like modelling, is not immune to automation. As a result, it’s likely that data scientists will find themselves leaning more and more into their subject matter expertise, communication and engineering skills in the future, rather than spending their time on dealing with missing values, hyperparameter optimization or model selection.

    显然,数据清理与建模一样,也无法避免自动化。 结果,数据科学家很可能会发现自己将来会越来越倾向于主题专业知识,沟通和工程技能,而不是将时间花在处理缺失值,超参数优化或模型选择上。

You can follow Ihab on Twitter here and you can follow me on Twitter here.

您可以遵循埃哈卜的Twitter在这里 ,你可以按照我的Twitter 这里

翻译自: https://towardsdatascience.com/data-cleaning-is-finally-being-automated-8cc964ea2e12

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值