工业大数据团队角色分类_数据团队的分类工具

最新推荐文章于 2024-04-18 09:34:14 发布

weixin_26750481

最新推荐文章于 2024-04-18 09:34:14 发布

阅读量317

点赞数

文章标签：人工智能 python java 大数据机器学习

原文链接：https://towardsdatascience.com/cataloging-tools-for-data-teams-8d62d7a4cd95

版权

本文介绍了工业大数据团队中各类角色的分类，涵盖了从数据科学家到工程师的职责划分，旨在为构建高效的数据团队提供参考。

摘要由CSDN通过智能技术生成

工业大数据团队角色分类

数据科学 (DATA SCIENCE)

With an explosion in the variety of data storage and retrieval systems in the last decade, data teams have had to deal with a lot of data sources — all being used for specific use cases. This gave birth to ETL solutions that integrate highly heterogeneous data sources, more flexible, highly scalable data warehouses and data lakes. With so many sources, copying, transforming and integrating data into a number of targets, a data system gets quite complex. That being said, there are tools like Airflow to manage the workflow orchestration, there are tools to monitor jobs and so on.

变送器处于各种数据的存储和在过去十年中检索系统的爆炸，数据团队不得不处理大量的数据源-都被用于特定用例。这催生了ETL解决方案，该解决方案集成了高度异构的数据源，更灵活，高度可伸缩的数据仓库和数据湖。数据源如此之多，将数据复制，转换和集成到许多目标中，数据系统变得相当复杂。话虽这么说，但是有诸如Airflow之类的工具可以管理工作流程，有一些工具可以监视工作，等等。

Data discovery is essential for data scientists, data analysts and business teams

数据发现对于数据科学家，数据分析师和业务团队至关重要

If you think that integrating all the data into a single source of truth or even multiple sources gets the job done, the truth couldn’t be far away from it. According to a report, data scientists at least 50% of their time finding, cleaning and preparing data. Data scientists, data analysts and business teams often find it difficult to find out what the data means and where it is coming from — both of these problems are quite commonly seen in data teams. The former can be solved by implementing a data catalog or a data dictionary and the latter can be dealt with by implementing a data lineage solution.

如果您认为将所有数据集成到单个事实来源或什至多个来源中就可以完成工作，那么事实与事实相距不远。根据一份报告，数据科学家至少有50％的时间用于查找，清理和准备数据。数据科学家，数据分析师和业务团队经常发现很难找出数据的含义以及数据的来源—这两个问题在数据团队中都很常见。前者可以通过实现数据目录或数据字典来解决，而后者可以通过实现数据沿袭解决方案来解决。

Both data lineage and data cataloging fall under the large umbrella of metadata management. In this article, we’ll talk about the most popular and efficient data cataloging tools available in the market. We’ll explore open-source project, proprietary software and cloud-based solutions that solve the problem of data discovery, cataloging and lineage and metadata management, in general.

数据沿袭和数据分类都属于元数据管理的大范围。在本文中，我们将讨论市场上最流行，最有效的数据分类工具。我们将探索开放源代码项目，专有软件和基于云的解决方案，这些解决方案通常可以解决数据发现，编目和沿袭以及元数据管理的问题。

Data systems highly heterogenous today — they’re multi-cloud, multi-source and even multi-target.

当今的数据系统高度异构-它们是多云，多源甚至多目标的。

开源数据目录 (Open-Source Data Catalogs)

There are not many contenders but the ones who are active are doing a good job such as Magda which was initiated by a government organization and has been adopted by many. With the backend storage in Postgres and search empowerment with Elasticsearch, Magda provides a great search engine like interface to search through your data. You can go see a live demo with real data sets here.

竞争者并不多，但积极的竞争者表现出色，例如由政府组织发起并为许多人所接受的Magda 。凭借Postgres中的后端存储和Elasticsearch的搜索功能，Magda提供了一个出色的搜索引擎，如用于搜索数据的界面。您可以在此处观看带有真实数据集的现场演示。

Image for post — Magda — Open-source Data Catalog

Before Magda came into existence, CKAN was the major open-source data catalog. In fact, Magda also uses parts of CKAN under the hood. The government of Canada and the government of US use CKAN as one of their metadata management systems.

在Magda出现之前， CKAN是主要的开源数据目录。实际上，Magda还使用引擎盖下的CKAN部件。加拿大政府和美国政府将CKAN用作其元数据管理系统之一。

Open-source by Lyft about a year and a half ago, Amundsen is another good contender in this area. It has been adopted by a number of very big companies like Workday and Asana. Unlike Magda, Amundsen uses neo4j as the backend database to store the metadata, but uses Elasticsearch for search capabilities. You can learn about Amundsen’s architecture here and from this Medium post by Tao Feng

Lyft开源大约一年半以前，Amundsen是该领域的另一个有力竞争者。它已被Workday和Asana等许多大公司采用。与Magda不同，Amundsen使用neo4j作为后端数据库来存储元数据，但使用Elasticsearch进行搜索。您可以在这里以及陶锋的《中刊》上了解阿蒙森的建筑

演示地址

Talk about Lyft’s Amundsen at DataCouncil.

在DataCouncil上谈论Lyft的Amundsen。

There are a number of other open source tools like LinkedIn’s DataHub, Airbnb’s Dataportal, Netflix’s Metacat, WeWork’s Marquez. You can find good resources about these tools in this article.

还有一些其他的开源工具，比如LinkedIn的DataHub，Airbnb的Dataportal ，Netflix的Metacat ，WeWork的马尔克斯。您可以在本文中找到有关这些工具的良好资源。

Honorary Mention — Spotify hasn’t open-sourced Lexikon yet but here’s an interesting read about how it solved data discovery issues for their data scientists.

荣誉提名 -Spotify尚未将Lexikon开源，但是这里有一篇有趣的文章，介绍了它如何为数据科学家解决数据发现问题。

云平台特定的数据目录 (Cloud Platform Specific Data Catalogs)

All major cloud platforms have a substantial number of services to offer now. With cloud based orchestration services, data pipelining and ETL solutions, there was a need for implementing a basic data cataloging component. Most of these solutions like AWS Glue Catalog and Google Cloud Data Catalog use the Hive Metastore underneath. Microsoft has its own implementation of the catalog in Azure Data Catalog.

所有主要的云平台现在都提供大量服务。借助基于云的编排服务，数据流水线和ETL解决方案，需要实现基本的数据编目组件。这些解决方案中的大多数(例如AWS Glue Catalog和Google Cloud Data Catalog)都在下面使用Hive Metastore 。 Microsoft在Azure数据目录中具有自己的目录实现。

Needless to say that these tools work really well with their respective coupled web services on their own cloud platforms but all of them are have limited features. They’re not meant for metadata management but for ensuring that they have enough data to support ETL operations, orchestration pipelines and so on. Think of them more like your database or data warehouse system catalog views & tables with some additional information. That’s about it.

不用说，这些工具在其各自的云平台上与它们各自耦合的Web服务一起使用时效果很好，但是它们的功能有限。它们并不是用于元数据管理，而是用于确保它们具有足够的数据来支持ETL操作，业务流程管道等。将它们更像是具有一些其他信息的数据库或数据仓库系统目录视图和表。就是这样

专有软件 (Proprietary Software)

With commerciality in mind, many companies have built fantastic full-fledged products for metadata management. Atlan, Ataccama and Alation are some of the major players in this market. There are many traditional players in the market too — Informatica being the most popular amongst them, so much so that it was named the leader in this industry in 2019 in the Gartner Magic Quadrant for Metadata Management Solutions.

考虑到商业性，许多公司已经为元数据管理构建了出色的功能完善的产品。 Atlan ， Ataccama和Alation是该市场的主要参与者。市场上也有许多传统企业— Informatica是其中最受欢迎的企业，以至于它在2019年的Gartner元数据管理解决方案魔力象限中被评为该行业的领导者。

Proprietary software have a one up on open-source software on being ready for enterprise usage, great technical support and also product enhancements according to the clients. They usually have a very engaging UI and a fairly bug free codebase. All that obviously comes with the money 😄

专有软件具有一种开放源代码软件，可随时用于企业使用，强大的技术支持以及根据客户的产品增强功能。它们通常具有非常吸引人的UI和相当没有错误的代码库。显然所有这些都是钱comes

结论 (Conclusion)

In the end, it is worth reiterating that metadata management is one of the less taken care of problems in the data engineering world — using proper tools for metadata management and data discovery can make life easier and work efficient for data scientists, data analysts and the business users in your company. Make sure that you choose the right ones that suit your needs!

最后，值得重申的是，元数据管理是数据工程领域中较少受到关注的问题之一-使用适当的工具进行元数据管理和数据发现可以使数据科学家，数据分析人员和数据库管理员的工作更加轻松并高效地工作。公司中的业务用户。确保选择适合您需求的正确的！

翻译自: https://towardsdatascience.com/cataloging-tools-for-data-teams-8d62d7a4cd95

工业大数据团队角色分类

weixin_26750481

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
工业大数据团队角色分类_数据团队的分类工具

工业大数据团队角色分类数据科学 (DATA SCIENCE)With an explosion in the variety of data storage and retrieval systems in the last decade, data teams have had to deal with a lot of data sources — all being used for s...
复制链接

扫一扫