数据工程工具

最新推荐文章于 2024-06-19 09:48:59 发布

weixin_26711425

最新推荐文章于 2024-06-19 09:48:59 发布

阅读量354

点赞数

文章标签： python java 人工智能大数据

原文链接：https://medium.com/dwelo-r-d/data-engineering-tools-dd69ba9f62b8

版权

Do you like asking questions? Then welcome to the world of analytics.

你喜欢问问题吗？ 然后欢迎来到分析世界。

Before joining Dwelo as an intern, I had only known one famous BI tool — Tableau. But to become a data engineer or a data analyst one must be proficient with new tools as well. So, thanks to my team that they let me experiment with all the new visualization tools that are Periscope, Sigma Computing, and introducing to DBT (Data Built Tool) Models. And, I would say I’m quite fortunate to have worked with the stack proposed by Fishtown Analytics.

在作为实习生加入Dwelo之前，我只知道一个著名的BI工具-Tableau。但是，要成为一名数据工程师或数据分析师，还必须精通新工具。因此，多亏了我的团队，他们让我尝试了所有新的可视化工具，包括Periscope，Sigma Computing和DBT(数据构建工具)模型。而且，我很幸运能够使用Fishtown Analytics提出的堆栈。

What is Periscope?

什么是潜望镜？

Periscope is software for envisioning data utilizing SQL, Python, and R in one spot. It allows business clients to make choices on organization measurements. It was fascinating to analyze all of Dwelo’s organization metrics as an IoT company. They favored utilizing SQL for visualizing information. To utilize this software, one must know SQL at an intermediate level. It also allows you to create a View of the data and visualize it by simply dragging and dropping the fields.

Periscope是一种用于在一处使用SQL，Python和R来预想数据的软件。它使业务客户可以选择组织度量标准。作为物联网公司，分析Dwelo的所有组织指标都非常有趣。他们赞成利用SQL来可视化信息。要使用此软件，必须了解中级SQL。它还允许您创建数据的视图并通过简单地拖放字段将其可视化。

More details: https://dtdocs.sisense.com/topic/sql

更多详细信息： https : //dtdocs.sisense.com/topic/sql

Yet, one question why Periscope when there are so numerous different analytical tools?

然而，有一个问题为什么为什么Periscope拥有如此众多的不同分析工具？

Well, the appropriate response is each leader can make their own dashboard and convey the profound experiences that drive the business forward. During my introductory days, I got an opportunity to explore and make my own dashboard with the existing data models which contained some tables like Communities, Devices, and Units. I could analyze some interesting questions immediately such as…

好吧，适当的回应是每个领导者都可以制作自己的仪表板并传达推动业务发展的深刻经验。在介绍性的几天里，我有机会使用现有的数据模型来探索和制作自己的仪表板，其中包含一些表，例如“社区”，“设备”和“单位”。我可以立即分析一些有趣的问题，例如…

What are the battery level readings per device?

每个设备的电池电量读数是多少？

So, I just wrote a simple join query with aggregate functions like

所以，我只是用聚合函数编写了一个简单的联接查询

and following query provided me the insights of devices such as, there are around 22k thermostats that have a battery level of 100. This means that 73% of thermostats are running with full power.

以下查询为我提供了有关设备的见解，例如，大约22k的恒温器的电池电量为100。这意味着73％的恒温器在全功率下运行。

On the off chance that you’re curious about Dwelo, the central goal is to make living and working in rental lodging more magical, effective, and automated. They use Z-Wave devices in apartments, additional thermostats that can communicate with Google cloud, and the perimeter doors can be controlled by an access system like pdk.io. Every apartment has a hub and device installation where the tenants can control their lights and there are property managers as well who need to monitor units and know things like if the painters left the air conditioner running in a vacant unit.

如果您对Dwelo感到好奇，那么我们的主要目标是使出租房的生活和工作更加神奇，有效和自动化。他们在公寓中使用Z-Wave设备，可以与Google Cloud通信的其他恒温器，并且外围门可以由pdk.io之类的访问系统控制。每个公寓都有一个集线器和设备安装，租户可以在那里控制他们的灯光，还有物业经理也需要监视单位并了解诸如油漆工是否让空调在空置的房间中运行等信息。

For more details on our sensor architecture check out: https://medium.com/dwelo-r-d/what-are-you-syncing-about-16c9151e9ff

有关我们的传感器架构的更多详细信息，请查看： https : //medium.com/dwelo-rd/what-are-you-syncing-about-16c9151e9ff

This allowed me to take some interesting insights where I could analyze questions like:

这使我获得了一些有趣的见解，可以分析以下问题：

What are the total number of active hubs in the last month?

上个月活动的集线器总数是多少？

What are the average number of users created monthly?

每月创建的平均用户数是多少？

What are the average number of hubs created monthly?

每月平均创建多少个集线器？

Periscope certainly centers around the comprehension of SQL, information base, and the questions you have to examine. But let us imagine a scenario where someone is not that open to working with SQL but might want to analyze the information in a simpler manner? One solution could be trying to improve your SQL skills to continue in the same tool. Or you can always experiment with another tool such as Sigma Computing, a worksheet-based analytics tool inspired by Excel.

Periscope当然围绕着SQL的理解，信息库以及您必须检查的问题。但是，让我们想象一下这样的场景：某人不愿意使用SQL，但可能想以一种更简单的方式分析信息？一种解决方案可能是尝试提高您SQL技能以继续使用同一工具。或者，您始终可以尝试使用另一种工具，例如Sigma Computing，这是一种受Excel启发的基于工作表的分析工具。

What is Sigma Computing?

什么是Sigma Computing？

Sigma provides business intelligence and analytics administration by permitting the data teams at each level to analyze information, without code, and settle on knowledge-driven choices rapidly.

Sigma通过允许每个级别的数据团队分析信息而无需编写代码，并Swift解决知识驱动的选择，从而提供了商务智能和分析管理。

We explored this whole tool while migrating the existing dashboard from the Periscope. We first replicated a chart about Community usage which consisted of columns such as the unique ID of the community, hub metadata, occupancy information, and hub connectivity.

我们在从Periscope迁移现有仪表板的同时探索了整个工具。我们首先复制了有关社区使用情况的图表，该图表由诸如社区的唯一ID，集线器元数据，占用信息和集线器连接性之类的列组成。

To migrate this chart to Sigma Computing, we attempted to comprehend the SQL rationale behind this, so we broke down a long query powering a chart part by part and added the required tables in Sigma like Community, Units, and Hubs.

为了将此图表迁移到Sigma Computing，我们试图理解其背后SQL基础，因此我们分解了一个冗长的查询，以部分方式支持图表，并在Sigma中添加了必需的表，例如社区，单位和集线器。

In Periscope, to join two tables we will need to compose a basic join query — yet, Sigma gives you a simpler alternative by simply clicking the link option and that will interface the two tables.

在Periscope中，要联接两个表，我们将需要组成一个基本的联接查询-但是，Sigma通过单击链接选项为您提供了一个更简单的替代方法，它将连接这两个表。

It makes it easier to understand the concept of Primary and Foreign keys. The image below is an example of how to tables are joined in the UI:

它使您更容易理解主键和外键的概念。下图是如何在UI中联接表的示例：

How do WHERE, GROUP BY, and aggregate function work in Sigma?

WHERE，GROUP BY和聚合函数在Sigma中如何工作？

Sigma has some built-in operations like group by and aggregate functions like Count, Sum, Min, and Max which are easy to activate through a mouse click, while in Periscope the user must know how to write the group by clause and when to use it. Sigma also has a filter feature which behaves like a WHERE clause, allowing you to include and exclude values through a UI.

Sigma具有一些内置操作，例如分组依据和合计函数(如Count，Sum，Min和Max)，可通过单击鼠标轻松激活，而在Periscope中，用户必须知道如何编写分组依据子句以及何时使用它。 Sigma还具有类似于WHERE子句的过滤器功能，允许您通过UI包含和排除值。

What if Sigma detects a JSON column?

如果Sigma检测到JSON列怎么办？

Well, that is the best aspect of the Sigma! It makes so straightforward by including an ‘extract’ feature to extricate semi-structured information. At the point that Sigma recognizes JSON or Variant segment types, ‘Extract Columns’ turns into an alternative in the section menu. On the off chance that your information is semi-organized and you do not see the ‘Extract Columns’ alternative, you can utilize the sort work JSON or Variant to change how Sigma deciphers the segment of information.

好吧，那是Sigma最好的方面！通过包括“提取”功能来提取半结构化信息，它变得如此简单。在Sigma识别JSON或Variant段类型的时候，“提取列”在部分菜单中变为备用。如果您的信息是半组织的，并且看不到“提取列”的替代方法，则可以使用排序工作JSON或Variant来更改Sigma解密信息段的方式。

How is the Sigma dashboard is different from Periscope?

Sigma仪表板与Periscope有何不同？

Sigma permits you to effortlessly share the analysis you have made in your worksheets through a custom dashboard.

Sigma允许您通过自定义仪表板轻松共享在工作表中所做的分析。

Challenges faced?

面临挑战？

A complex and large amount of data can slow the worksheet.
复杂而大量的数据会使工作表变慢。
If I compare the tool with Tableau, then there is an extremely limited number of charts.
如果将工具与Tableau进行比较，则图表数量非常有限。

Why did we ultimately go with Sigma Computing?

为什么我们最终选择了Sigma Computing？

We chose this tool so that each and every employee who is not technically strong can visualize and effectively utilize the data. Also, it is easy to use with drag and drop features. The spreadsheet-style interface and formulas are an easy transition for business users who are not used to working with complex BI platforms. They can play with any data source that the Sigma administrator grants them permission to access and create their own data models. It is now great to see how other employees in Dwelo are able to analyze their data by just having some preexisting Excel knowledge.

我们选择此工具是为了使每个技术上不熟练的员工都可以可视化并有效地利用数据。而且，它易于使用拖放功能。对于不习惯使用复杂BI平台的业务用户，电子表格样式的界面和公式是一种轻松的过渡。他们可以使用Sigma管理员授予他们访问和创建自己的数据模型权限的任何数据源。现在，很高兴看到Dwelo中的其他员工如何仅通过具备一些预先存在的Excel知识就能分析其数据。

The complete tutorial could be found on: https://help.sigmacomputing.com/hc/en-us/categories/360003351314-Tutorials

完整的教程可以在以下网站找到： https : //help.sigmacomputing.com/hc/en-us/categories/360003351314-Tutorials

Our next project was to analyze and visualize the data on user behavior in Sigma but while exploring we realized we will need some additional data models to be built. It was a good opportunity for us to learn more about the DBT model.

我们的下一个项目是分析和可视化Sigma中有关用户行为的数据，但是在探索时，我们意识到我们将需要构建一些其他数据模型。对于我们来说，这是一个学习DBT模型的好机会。

What is the DBT model?

什么是DBT模型？

DBT is a command line prompt that allows data engineers to transform the data in their warehouses very efficiently by simply writing a single select statement. It performs T in ETL — the transformation of the data. Dwelo uses traditional data warehouse tools like Redshift, BigQuery, and Stitch, and hence it was best to fit to use the DBT model.

DBT是一种命令行提示符，它使数据工程师只需编写一条select语句，即可非常高效地转换仓库中的数据。它在ETL中执行T-数据转换。 Dwelo使用传统的数据仓库工具，例如Redshift，BigQuery和Stitch，因此最好适合使用DBT模型。

The only task of DBT is to take code, compile it in SQL and then run against the database. We understood this concept while creating one new model for user behavior.

DBT的唯一任务是获取代码，用SQL对其进行编译，然后针对数据库运行。我们在为用户行为创建一个新模型时理解了这一概念。

Why only one select statement?

为什么只有一个选择语句？

DBT creates only one .sql extension file at a time hence it will create one model or one select statement. Models are materialized either as view, table, or incremental models. For our model, we preferred creating a table as they process faster queries and slower transformation. The objective of the staging layer is to make organizing models. They take crude information, and clean and set them up for further analysis. So we essentially made staging tables for central information.

DBT一次仅创建一个.sql扩展文件，因此它将创建一个模型或一个select语句。模型以视图，表或增量模型的形式实现。对于我们的模型，我们更喜欢创建一个表，因为它们可以处理更快的查询和更慢的转换。登台层的目的是建立组织模型。他们获取原始信息，并清理并设置它们以进行进一步分析。因此，我们实质上是为中央信息制作了临时表。

dim_table is the short table where each line is a person, place, or thing. This is a different to afct_table they are small and mutable. I created dim_table for our data using the ref function, which created a dependency graph between the models

dim_table是简短表格，其中每一行都是一个人，一个地方或事物。这与afct_table不同，它们很小且易变。我使用ref函数为我们的数据创建了dim_table，该函数在模型之间创建了依赖图

How do we run our dbt commands?

我们如何运行dbt命令？

We preferred using Git to cast out our changes using docker and ubuntu. For every new change in DBT, we simply pass the dbt run model name command

我们更喜欢使用Git使用docker和ubuntu取消更改。对于DBT中的每个新更改，我们只需传递dbt run model name命令

Once we created the tables it was important for us to pass certain data types as unique and not null in order to avoid duplicity and redundancy. For that purpose, we created a test case where DBT checks whether the test failed or pass. I simply passed dbt test model name. This helped me create a successful model.

一旦创建了表，对于我们来说，将某些数据类型作为唯一的而不是非null传递是很重要的，以避免重复和冗余。为此，我们创建了一个测试用例，其中DBT检查测试是否失败。我只是通过了dbt测试模型名称。这帮助我创建了成功的模型。

For more information on dbt you can refer to: https://docs.getdbt.com/docs/building-a-dbt-project/tests

有关dbt的更多信息，请参见： https : //docs.getdbt.com/docs/building-a-dbt-project/tests

To become a Data Engineer or Scientist one must love information and have curiosity to learn about the data. As Peter Sondergaard says “Information is the oil of the 21st century, and analytics is the combustion engine.” So keep learning!

要成为一名数据工程师或科学家，您必须热爱信息并且好奇地了解数据。正如Peter Sondergaard所说：“信息是21世纪的石油，分析是内燃机。” 所以继续学习吧！

https://blog.getdbt.com/track-data-changes-with-dbt-snapshots/

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

https://blog.getdbt.com/what--exactly--is-dbt-/

https://blog.getdbt.com/确切是-is-dbt- /

https://www.datawarehouse4u.info/reviews/business-intelligence/periscope-data

翻译自: https://medium.com/dwelo-r-d/data-engineering-tools-dd69ba9f62b8

weixin_26711425

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据工程工具

Do you like asking questions? Then welcome to the world of analytics. 你喜欢问问题吗？然后欢迎来到分析世界。 Before joining Dwelo as an intern, I had only known one famous BI tool — Tableau. But to become a data engine...
复制链接

扫一扫