使用spark nlp kafka和vegas可视化对聚会活动进行大数据分析-CSDN博客

We started out as a working group from bigdata.ro. The team was comprised of Valentina Crisan, Ovidiu Podariu, Maria Catana, Cristian Stanciulescu, Edwin Brinza and me, Andrei Deusteanu. Our main purpose was to learn and practice on Spark Structured Streaming, Machine Learning and Kafka. We designed the entire use case and then built the architecture from scratch.

我们从bigdata.ro作为工作组开始。团队由Valentina Crisan ， Ovidiu Podariu ， Maria Catana ，Cristian Stanciulescu， Edwin Brinza和我Andrei Deusteanu组成。我们的主要目的是学习和实践Spark结构化流，机器学习和Kafka。我们设计了整个用例，然后从头开始构建架构。

Since Meetup.com provides data through a real-time API, we used it as our main data source We did not use the data for commercial purposes, just for testing.

由于Meetup.com通过实时API提供数据，因此我们将其用作主要数据源。我们并未将数据用于商业目的，仅用于测试。

This is a learning case story. We did not really know from the beginning what would be possible or not. Looking back, some of the steps could have been done better. But, hey, that’s how life works in general.

这是一个学习案例。我们从一开始就并不真正知道什么可能或不可能。回顾过去，某些步骤本来可以做得更好。但是，嘿，生活就是这样。

The problems we tried to solve:

我们试图解决的问题：

Allow meetup organizers to identify trending topics related to their meetup. We computed Trending Topics based on the description of the events matching the tags of interest to us. We did this using the John Snow Labs Spark NLP library for extracting entities.
允许聚会组织者确定与聚会相关的热门话题。我们根据与我们感兴趣的标签匹配的事件的描述计算了趋势主题。我们使用John Snow Labs Spark NLP库提取实体进行了此操作。
Determine which Meetup events attract the most responses within our region. Therefore we monitored the RSVPs for meetups based on certain tags, related to our domain of interest — Big Data.
确定哪些聚会活动在我们区域内吸引最多的响应。因此，我们基于与我们感兴趣的领域(大数据)相关的某些标签监视了RSVP的聚会。

For this we developed 2 sets of visualizations:

为此，我们开发了两组可视化：

Trending Keywords
热门关键字
RSVPs Distribution
RSVP分配

建筑 (Architecture)

The first 2 elements are common in both sets of visualizations. This is the part that reads data from the Meetup.com API and saves it in 2 Kafka Topics.

前两个元素在两组可视化中都是相同的。这是从Meetup.com API读取数据并将其保存在2个Kafka主题中的部分。

The Stream Reader script fetches data on Yes RSVPs filtered by certain tags from the Meetup Stream API. It then selects the relevant columns that we need. After that it saves this data into the rsvps_filtered_stream Kafka topic.
Stream Reader脚本从Meetup Stream API中获取由某些标签过滤的Yes RSVP上的数据。然后，选择我们需要的相关列。之后，它将这些数据保存到rsvps_filtered_stream Kafka主题中。
For each RSVP, the Stream Reader script then fetches event data for it, only if the event_id does not exist in the events.idx file. This way we make sure that we read event data only once. The setup for the Stream Reader script can be found -> Install Kafka and fetch RSVPs
对于每个RSVP，仅当event.idx文件中不存在event_id时，Stream Reader脚本然后为其获取事件数据。这样，我们确保只读取一次事件数据。可以找到Stream Reader脚本的设置-> 安装Kafka并获取RSVP

NER管道的详细说明 (Detailed Explanation of the NER Pipeline)

In order to annotate the data, we need to transform it into a certain format, from text to numbers, and then back to text.

为了注释数据，我们需要将其转换为某种格式，从文本到数字，然后再回到文本。

We first use a DocumentAssembler to turn the text into a Document type.
我们首先使用DocumentAssembler将文本转换为Document类型。
Then, we break the document into sentences using a SentenceDetector.
然后，我们使用SentenceDetector将文档分解为句子。
After this we separate the text into smaller units by finding the boundaries of words using a Tokenizer.
之后，我们通过使用Tokenizer查找单词的边界，将文本分成较小的单元。
Next we remove HTML tags and numerical tokens from the text using a Normalizer.
接下来，我们使用规范化器从文本中删除HTML标记和数字标记。
After the preparation and cleaning of the text we need to transform it into a numerical format, vectors. We use an English pre-trained WordEmbeddingsModel.
在准备和清理文本之后，我们需要将其转换为数字格式，即矢量。我们使用英语预训练的WordEmbeddingsModel。
Next comes the actual keyword extraction using an English NerDLModel Annotator. NerDL stands for Named Entity Recognition Deep Learning.
接下来是使用英语NerDLModel Annotator进行实际的关键字提取。 NerDL代表命名实体识别深度学习。
Further on we need to transform the numbers back into a human readable format, a text. For this we use a NerConverter and save the results in a new column called entities.
进一步，我们需要将数字转换回人类可读的格式，即文本。为此，我们使用NerConverter并将结果保存在称为实体的新列中。
Before applying the model to our data, we need to run an empty training step. We use the fit method on an empty dataframe because the model is pretrained.
在将模型应用于我们的数据之前，我们需要运行一个空的训练步骤。因为模型是预先训练的，所以我们在空的数据帧上使用fit方法。
Then we apply the pipeline to our data and select only the fields that we’re interested in.
然后，我们将管道应用到我们的数据，并仅选择我们感兴趣的字段。
Finally we write the data in Kafka:TOPIC_KEYWORDS
最后，我们将数据写入Kafka：TOPIC_KEYWORDS

RSVP分配 (RSVPs Distribution)

Image for post — RSVPs Distribution Arhitecture

3. Using KSQL we aggregate and join data from the 2 topics to create 1 Stream, RSVPS_JOINED_DATA, and subsequently 1 Table, RSVPS_FINAL_TABLE containing all RSVPs counts. The KSQL operations and their code can be found here: Kafka — Detailed Architecture

3.使用KSQL，我们聚合并结合来自2个主题的数据以创建1个流RSVPS_JOINED_DATA，随后创建1个表RSVPS_FINAL_TABLE，其中包含所有RSVP计数。 KSQL操作及其代码可以在这里找到： Kafka —详细的体系结构

4. Finally, we use Vegas library to produce visualizations on the distribution of RSVPs around the world and in Romania. The Zeppelin notebook can be found here.

4.最后，我们使用维加斯图书馆来生成关于全球和罗马尼亚的RSVP分布的可视化图像。齐柏林飞艇笔记本可以在这里找到。

基础设施 (Infrastructure)

We used a machine from Hetzner Cloud with the following specs — CPU: Intel Xeon E3–1275v5 (4 cores/8 threads), Storage: 2×480 GB SSD (RAID 0), RAM: 64GB

我们使用的Hetzner Cloud机器具有以下规格-CPU：Intel Xeon E3-1275v5(4核/ 8线程)，存储：2×480 GB SSD(RAID 0)，RAM：64GB

可视化 (Visualizations)

RSVP分配 (RSVPs Distribution)

These visualizations are done on data between 8th of May 22:15 UTC and 4th of June 11:23 UTC.

这些可视化在5月8日22:15 UTC和6月4日11:23 UTC之间的数据上完成。

Worldwide — Top Countries by Number of RSVPs

全球范围—按RSVP数量排名前列的国家

Worldwide — Top Cities by Number of RSVPs

全球范围-回复人数最高的城市

Worldwide — Top Events by Number of RSVPs

全球范围—按RSVP数量划分的热门事件

Romania — Top Cities in Romania by Number of RSVPs

罗马尼亚-罗马尼亚受回复人数最多的城市

Romania — Top Meetup Events

罗马尼亚—顶级聚会活动

Romania — RSVPs Distribution

罗马尼亚— RSVP分配

Europe — RSVPs Distribution

欧洲-RSVP分配

一路发现的问题 (Issues discovered along the way)

All of these are mentioned in the published Notebooks as well.

所有这些在出版的笔记本中也有提及。

Visualizing data using Helium Zeppelin add-on and Vegas library directly from the stream did not work. We had to spill the data to disk, then build Dataframes on top of the files and finally do the visualizations.
直接使用流中的Helium Zeppelin附加组件和Vegas库对数据进行可视化是行不通的。我们必须将数据洒到磁盘上，然后在文件之上构建数据框，最后进行可视化。
Spark NLP did not work for us in a Spark standalone local cluster installation (with local file system). Standalone Local Cluster means that the cluster runs on the same physical machine — Spark Cluster Manager & Workers. Such a setup does not need distributed storage such as HDFS. The workaround for us was to configure Zeppelin to use local Spark, local (*), meaning a non-distributed single-JVM deployment mode available in Zeppelin.
在Spark独立本地集群安装(带有本地文件系统)中，Spark NLP对我们不起作用。独立本地群集意味着该群集在同一物理计算机上运行-Spark Cluster Manager＆Workers。这样的设置不需要像HDFS这样的分布式存储。我们的解决方法是将Zeppelin配置为使用本地Spark(本地(*))，这意味着Zeppelin中提供了非分布式单JVM部署模式。
Vegas plug-in could not be enabled initially. Running the github — %dep z.load(“org.vegas-viz:vegas_2.11:{vegas-version}”) — recommendation always raised an error. The workaround was to add all the dependencies manually in /opt/spark/jars. These dependencies can be found when deploying spark shell with the Vegas library — /opt/spark/bin/spark-shell –packages org.vegas-viz:vegas-spark_2.11:0.3.11
最初无法启用Vegas插件。运行github —％dep z.load(“ org.vegas-viz：vegas_2.11：{vegas-version}”)—建议始终会引发错误。解决方法是在/ opt / spark / jars中手动添加所有依赖项。在Vegas库中部署spark shell时可以找到这些依赖关系-/ opt / spark / bin / spark-shell –packages org.vegas-viz：vegas-spark_2.11：0.3.11
Helium Zeppelin addon did not work/couldn’t be enabled. This too raised an error when enabling it from Zeppelin GUI in our configuration. We did not manage to solve this issue. That’s why we used only Vegas, although it does not support Map visualizations. In the end we got creative a bit — we exported the data and loaded it into Grafana for Map visualizations.
Help Zeppelin附加组件无法正常工作/无法启用。在我们的配置中从Zeppelin GUI启用它时，也会引发错误。我们未能解决此问题。这就是为什么我们仅使用Vegas的原因，尽管它不支持Map可视化。最后，我们有了一点创意-我们导出了数据并将其加载到Grafana中以进行地图可视化。
The default retention policy for Kafka is 7 days. This means that data older than 1 week is deleted. For some of the topics we changed this setting, but for some we forgot to do this and therefore we lost the data. This affected our visualization for the Trending Keywords in Romania.
Kafka的默认保留政策是7天。这意味着将删除1周之前的数据。对于某些主题，我们更改了此设置，但是对于某些主题，我们忘记了设置，因此丢失了数据。这影响了我们对罗马尼亚“趋势关键字”的可视化。

结论与学习要点 (Conclusions & Learning Points)

In the world of Big Data you need clarity around the questions you’re trying to answer before building the Data Architecture and then follow through the plan to make sure you’re still working according to those questions. Otherwise, you might end up with something that can’t do what you actually need. It sounds a pretty general statement and pretty “DOH, OBVIOUSLY”. Once we’ve seen the visualizations, we realized that we did not create the Kafka objects according to our initial per country keywords distribution visualization — e.g. we created the count aggregation per all countries, in the KEYWORDS_COUNTED Table. Combine this with the mistake of forgetting to change the Kafka retention period from the default 7 days, by the time we realized the mistake we had lost the historical data as well. Major learning point.
在大数据世界中，在构建数据体系结构之前， 您需要明确要解决的问题 ，然后按照计划进行操作，以确保您仍根据这些问题进行工作 。否则，您可能最终会遇到无法满足您实际需求的事情。这听起来很笼统，很“ DOH，显然”。看到可视化后，我们意识到我们没有根据最初的每个国家/地区关键字分布可视化来创建Kafka对象-例如，我们在KEYWORDS_COUNTED表中创建了所有国家/地区的计数汇总。结合错误忘记忘记将Kafka保留期从默认的7天更改为错误，直到我们意识到错误也丢失了历史数据。主要学习点。
Data should be filtered in advance of the ML/NLP process — we should have removed some keywords that don’t exactly make sense such as “de”, “da”. In order to get more relevant insights maybe several rounds of data cleaning and extracting the keywords might be needed.
数据应在ML / NLP流程之前进行过滤 -我们应该删除一些不完全有意义的关键字，例如“ de”，“ da”。为了获得更多相关的见解，可能需要几轮数据清理和提取关键字。
After seeing the final visualizations we should probably have filtered a bit more some of the obvious words. For example of course Zoom was the highest scoring keyword since by June everybody was running only online meetups mainly on Zoom.
在看到最终的可视化效果之后，我们可能应该过滤掉一些明显的词。例如，当然，Zoom是自6月以来得分最高的关键字，每个人都主要在Zoom上仅运行在线聚会。

This study group was a great way for us to learn about an end-to-end solution that uses Kafka to ingest streaming data, Spark to process it and Zeppelin for visualizations. We recommend this experience for anyone interested in learning Big Data technologies together with other passionate people, in a casual and friendly environment.

Ť他的研究小组对我们来说，了解一个终端到终端的解决方案，它使用卡夫卡摄取流数据，星火来处理它和齐柏林为可视化的好方法。对于有兴趣在休闲和友好的环境中与其他热情的人一起学习大数据技术的人，我们建议您使用这种体验。

This article originally appeared on https://bigdata.ro/2020/08/09/spark-working-group/

本文最初出现在https://bigdata.ro/2020/08/09/spark-working-group/