pinot_用apache pinot和superset建立一个气候仪表板

pinot

In this blog post, I’d like to show you how Apache Pinot can be used to easily ingest, query, and visualize millions of climate events sourced from the NOAA storm database.

在此博客文章中,我想向您展示如何使用Apache Pinot轻松地摄取,查询和可视化来自NOAA风暴数据库的数百万个气候事件。

引导您的气候仪表板 (Bootstrap your climate dashboard)

I’ve created an open source example which will fully bootstrap a climate data dashboard with Apache Pinot as the backend and Superset as the frontend. In three simple commands, you’ll be up and running and ready to analyze millions of storm events.

创建了一个开放源代码示例,该示例将完全引导气候数据仪表板,其中Apache Pinot作为后端, Superset作为前端。 通过三个简单的命令,您便可以启动并运行,并准备分析数百万个风暴事件。

Repository: https://github.com/kbastani/climate-change-analysis

仓库: https : //github.com/kbastani/climate-change-analysis

运行仪表板 (Running the dashboard)

Superset is an open source web-based business intelligence dashboard. You can think of it as a kind of “Google analytics” for anything you want to analyze.

Superset是一个基于Web的开源商业智能仪表板。 您可以将其视为要分析的任何事物的“ Google分析”。

Image for post

After cloning the GitHub repository for the example, go ahead and run the following commands.

在克隆了示例的GitHub存储库后,继续并运行以下命令。

$ docker network create PinotNetwork$ docker-compose up -d$ docker-compose logs -f — tail=100

$ docker network创建PinotNetwork $ docker-compose up -d $ docker-compose logs -f — tail = 100

After the containers have started and are running, you’ll need to bootstrap the cluster with the NOAA storm data. Make sure you give the cluster enough time and memory to start the different components before proceeding. When things look good in the logs, go ahead and run the next command to bootstrap the cluster.

容器启动并运行后,您需要使用NOAA风暴数据来引导集群。 在继续操作之前,请确保给集群足够的时间和内存来启动不同的组件。 当日志中一切正常时,继续执行下一个命令以引导集群。

$ sh ./bootstrap.sh

$ sh ./bootstrap.sh

This script does all the heavy lifting of downloading the NOAA storm events database and importing the climate data into Pinot. After the bootstrap script runs to completion, a new browser window will appear asking you to sign in to Superset. Type in the very secure credentials admin/admin to login and access the climate dashboards.

该脚本可以完成所有繁重的工作,包括下载NOAA风暴事件数据库并将气候数据导入Pinot。 引导脚本运行完成后,将出现一个新的浏览器窗口,要求您登录Superset。 输入非常安全的凭据admin / admin来登录和访问气候仪表板。

分析气候数据 (Analyzing climate data)

For this blog post, I wanted to make it as easy as possible to bootstrap a dashboard so that you can start exploring the climate data. Under the hood of this example there are some interesting things going on. We basically have a Ferrari supercar in the form of a real-time OLAP datastore called Apache Pinot doing the heavy lifting.

对于此博客文章,我希望尽可能轻松地引导仪表板,以便您可以开始探索气候数据。 在此示例的幕后,发生了一些有趣的事情。 我们基本上有一个称为Apache Pinot的实时OLAP数据存储形式的法拉利超级跑车,可以完成繁重的工作。

Pinot is used at LinkedIn as an analytics backend, serving 700 million users in a variety of different features, such as the news feed. The next blog post in this series will focus just on the technical implementation and architecture.

Pinot在LinkedIn上用作分析后端,可为7亿用户提供各种不同功能(例如新闻订阅源)的服务。 本系列的下一篇博客文章将仅关注技术实现和体系结构。

源数据 (Source data)

The data I’ve decided to use for this dashboard is sourced from the NOAA’s National Center for Environmental Information (NCEI). While there are many different kinds of datasets one might want to use as a dashboard for analyzing climate data, the one I’ve chosen to focus on is storm events.

我决定用于此仪表板的数据来自NOAA的国家环境信息中心(NCEI) 。 尽管可能有许多不同类型的数据集可以用作分析气候数据的仪表板,但我选择重点关注的是风暴事件。

A comprehensive detailed guide of the source data and columns can be found in PDF format here. After running the bootstrap, you can use Apache Pinot’s query console to quickly search through the data, which gives you a pretty good idea about what it contains.

在此处以PDF格式找到有关源数据和列的全面详细指南。 运行引导程序后,您可以使用Apache Pinot的查询控制台快速搜索数据,这使您对其中包含的内容有了一个很好的了解。

Image for post

According to the NCEI website, the Storm Events Database is used to generate the official NOAA Storm Data publication, documenting:

根据NCEI网站,风暴事件数据库用于生成NOAA风暴数据的官方出版物,该出版物记录了:

  1. The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce;

    具有足够强度的风暴和其他重大天气现象的发生,造成人员伤亡,重大财产损失和/或商业中断;
  2. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and

    引起媒体关注的罕见,不寻常的天气现象,例如南佛罗里达或圣地亚哥沿海地区的大雪; 和
  3. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

    其他重大的气象事件,例如记录的最高或最低温度或与另一事件有关的降水。

The database contains millions of storm events recorded from January 1950 to May 2020, as entered by NOAA’s National Weather Service (NWS).

该数据库包含1950年1月至2020年5月记录的数百万起风暴事件,由NOAA的国家气象局(NWS)输入

利用Superset进行气候变化分析 (Climate change analysis with Superset)

With Superset, you can create your own dashboards using Apache Pinot as the datasource. When creating the dashboards included in this example, I could have spent months on creating cool interactive charts, but to start out I decided to create just a few.

使用Superset,您可以使用Apache Pinot作为数据源创建自己的仪表板。 在创建此示例中包含的仪表板时,我可能已经花了几个月的时间来创建很酷的交互式图表,但是从一开始我就决定只创建几个。

Since the source data contains geolocation coordinates for each storm event, the first thing I thought of visualizing was a map of the US showing all storms since 1950. That was a tad ambitious since there are over 1.6 million storm events.

由于源数据包含每个风暴事件的地理位置坐标,因此我想到的第一件事是一张美国地图,显示了1950年以来的所有风暴。这有点大胆,因为有超过160万个风暴事件。

I decided to implement some yearly filters as well as storm event types. As I played around more with the charting tools in Superset, I figured out how to visualize how many people were injured as a result of each storm event. Below we can see a tornado that injured 30 people, surrounded by many other different types of storms.

我决定实施一些年度过滤器以及风暴事件类型。 当我更多地使用Superset中的图表工具时,我想出了如何可视化每个风暴事件导致多少人受伤的信息。 在下方,我们可以看到龙卷风造成30人受伤,周围有许多其他类型的风暴。

Image for post

As a part of this dashboard, you can now see how many people were injured in any storm event by geographic location within a time period. The storm map also sizes the points on the map and color codes them based on the magnitude of injuries and the type of storm event. In the screenshot above, we have pink circles representing tornado injuries.

作为此仪表板的一部分,您现在可以按时间段查看地理位置中任何风暴事件中有多少人受伤。 风暴图还会对地图上的点进行大小调整,并根据伤害的大小和风暴事件的类型对它们进行颜色编码。 在上面的屏幕截图中,我们有粉红色的圆圈代表龙卷风受伤。

冰雹和雷暴分析 (Hail and thunderstorm analysis)

If anyone was wondering if data science is actual science, the answer is probably no. I spend time creating open source examples and recipes so others can analyze the data without bothering with all the boring infrastructure and software things. Sometimes during this process of creating examples, it feels good to point at some chart and say something exciting about what I find. I encourage more people to do that, whether or not it is scientific to make such claims. There is so much climate data and ways to visualize how it is changing, I think it’s a whole of civilization and societal responsibility to make interesting discoveries.

如果有人想知道数据科学是否是实际科学,答案可能是否定的。 我花时间创建开放源代码示例和食谱,以便其他人可以分析数据,而不必担心所有无聊的基础架构和软件问题。 有时在创建示例的过程中,指向一些图表并说一些令人兴奋的发现是很不错的。 我鼓励更多的人这样做,无论提出这样的主张是否科学。 有太多的气候数据和可视化方式来显示变化的方式,我认为做出有趣的发现是整个文明和社会的责任。

Here is one example where I discovered an interesting anomaly in the periodicity and intensity of thunderstorm and hail storm seasons.

在以下示例中,我发现了雷暴和冰雹风暴季节的周期性和强度方面的一个有趣异常。

Image for post

What we are looking at here is over twenty-seven thousand hail and thunderstorm events since 1950. Naturally, the count would be seen to be increasing due to better ways to collect the events by the NWS. I spent some time analyzing this chart to understand the implications of what I was seeing. When hail storms and thunderstorms diverge significantly over the span of the seventy years charted out here, it’s possible that there is a correlation between damaging events such as tornadoes, wildfires, droughts, and heat waves. I’m glad I was able to find this visualization, because it does certainly beg questions that a climate scientist might be able to answer.

自1950年以来,我们在这里看到的冰雹和雷暴事件超过2万7千次。自然地,由于NWS采取了更好的收集事件的方法,因此这一数目将不断增加。 我花了一些时间分析此图表以了解我所看到的含义。 当冰雹暴雨和雷暴暴雨在此处确定的七十年间显着不同时,龙卷风,野火,干旱和热浪等破坏性事件之间可能存在关联。 我很高兴能够找到这种可视化效果,因为它确实引起了气候科学家可能能够回答的问题。

风暴频率和季节变化 (Storm frequency and seasonal variability)

The next visualization I came up with was to see the storm event variation season to season over a period of years.

我想到的下一个可视化对象是查看暴风雨事件在几年中的季节变化。

Image for post

This chart is far more palatable than the last one I showed. If anything, it looks super pretty, while also being quite useful. Here we can quickly see anomalies year to year in the volume of certain types of events. One such example is evidence of increased floods in 2018 and 2019. We also see that both extreme cold and excessive heat have been far more prevalent in the last three years. Overall, when analyzing this chart, if things aren’t lining up nicely in equal proportions, that could potentially be a sign of climate change.

该图表比我显示的最后一个图表更可口。 如果有的话,它看起来超级漂亮,同时也很有用。 在这里,我们可以快速查看某些类型事件的逐年异常情况。 一个这样的例子就是2018年和2019年洪水泛滥的证据。我们还看到,在过去三年中,极端寒冷和过热的现象更加普遍。 总体而言,在分析该图表时,如果事情没有按相等的比例很好地排列,则可能是气候变化的迹象。

气候热图 (Climate heat map)

The last chart I came up with for this blog post was the most interesting for both its visual aesthetic and interpretability.

我在此博客文章中得出的最后一个图表在视觉美学和可解释性方面最为有趣。

Image for post

Above, we have the yearly climate events as a heat map that I’ve grouped by US state and region. The very first thing I noticed is that everything is indeed bigger in Texas, even the storms! The next thing I noticed was that California has started to look similar to Texas in the last six years. Another interesting area worth further exploration is the year 2008 and 2011. Both of these two years show an abnormal increase in storm events that affected every US state and region. There is clearly an answer here for why that is, however, it’s worth more exploration using other kinds of analysis. It would be hard to conclude on any cause just by looking at this chart.

上方,我们将每年的气候事件作为热点图,并按美国各州和地区进行了分组。 我注意到的第一件事是,德克萨斯州的一切确实更大,甚至是暴风雨! 我注意到的第二件事是,过去六年来,加利福尼亚州开始与德克萨斯州相似。 另一个值得进一步探索的有趣领域是2008年和2011年。这两年都显示风暴事件异常增加,影响了美国的每个州和地区。 显然有一个答案可以解释为什么这样做,但是,使用其他类型的分析值得我们进行更多的探索。 仅通过查看此图表就很难得出任何原因的结论。

Heat maps like this are great for identifying things to investigate, rather than to make any conclusions.

这样的热图非常适合识别要调查的事物,而不是得出任何结论。

结论 (Conclusion)

As a part of this project, I wanted to take the opportunity to craft an example for folks while also teaching myself more about climate change. I’ve found that there is so much to this subject.

作为该项目的一部分,我想借此机会为人们树立榜样,同时也让自己更多地了解气候变化。 我发现这个主题有很多。

Often, I see folks on Twitter toss around the terms climate change and global warming as if these things were as easy to understand as watching one or two documentaries on Netflix. Creating this dashboard gave me an opportunity to understand the hard work that goes into creating both the science and infrastructure necessary to analyze climate data.

通常,我在Twitter上看到人们将气候变化和全球变暖这个词折腾起来,就好像这些东西在Netflix上观看一两个纪录片一样容易理解。 创建此仪表板使我有机会了解创建分析气候数据所需的科学和基础设施方面的艰苦工作。

Climate change is a broad topic, and global warming is just one part of it. The climate is actually always changing, and it always has been. Some of the world’s hottest and most arid deserts in Africa used to be lakes. In fact, the world today may have never been as hospitable to our lifestyles than it is today. What climate scientists spend their time on is understanding the history of climate change so that they can predict future damage to the many different ecosystems hosting biological life around our world.

气候变化是一个广泛的话题,全球变暖只是其中的一部分。 实际上,气候一直在变化,而且一直在变化。 非洲一些世界上最热,最干旱的沙漠曾经是湖泊。 实际上,当今世界对我们的生活方式从未像现在这样热情好过。 气候科学家花费的时间是了解气候变化的历史,以便他们可以预测未来对承载世界各地生物生命的许多不同生态系统的破坏。

Extreme weather events, ones that have a recurring frequency, like hurricanes and tornadoes, happen more or less frequently in areas depending on climate events. When a sudden and unpredicted climate event happens, it may cost billions of dollars and result in many injuries and deaths.

极端天气事件(如飓风和龙卷风等频繁发生的事件)在某些地区取决于气候事件而频繁发生。 当突发的,不可预测的气候事件发生时,它可能会耗资数十亿美元,并导致许多人身伤亡。

Image for post

下一步 (Next steps)

Thanks for reading! Stay tuned for the next blog post that dives deep into the technical bowels of this example to understand how OLAP datastores like Apache Pinot work.

谢谢阅读! 请继续关注下一篇博文,深入探讨该示例的技术知识,以了解OLAP数据存储(如Apache Pinot)如何工作。

Please share this blog post on social media to get the word out about climate science and climate change. Also, if you’re a scientist and want to work on doing some innovative climate research using Apache Pinot, please reach out to me. I’d love to help.

请在社交媒体上分享此博客文章,以了解有关气候科学和气候变化的信息。 另外,如果您是科学家,并且想使用Apache Pinot进行一些创新的气候研究,请与我联系。 我很乐意提供帮助。

翻译自: https://medium.com/apache-pinot-developer-blog/building-a-climate-dashboard-with-apache-pinot-and-superset-d3ee8cb7941d

pinot

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值