使用spark nlp kafka和vegas可视化对聚会活动进行大数据分析

We started out as a working group from bigdata.ro. The team was comprised of Valentina Crisan, Ovidiu Podariu, Maria Catana, Cristian Stanciulescu, Edwin Brinza and me, Andrei Deusteanu. Our main purpose was to learn and practice on Spark Structured Streaming, Machine Learning and Kafka. We designed the entire use case and then built the architecture from scratch.

我们从bigdata.ro作为工作组开始 。 团队由Valentina CrisanOvidiu PodariuMaria Catana ,Cristian Stanciulescu, Edwin Brinza和我Andrei Deusteanu组成 。 我们的主要目的是学习和实践Spark结构化流,机器学习和Kafka。 我们设计了整个用例,然后从头开始构建架构。

Since Meetup.com provides data through a real-time API, we used it as our main data source We did not use the data for commercial purposes, just for testing.

由于Meetup.com通过实时API提供数据,因此我们将其用作主要数据源。我们并未将数据用于商业目的,仅用于测试。

This is a learning case story. We did not really know from the beginning what would be possible or not. Looking back, some of the steps could have been done better. But, hey, that’s how life works in general.

这是一个学习案例。 我们从一开始就并不真正知道什么可能或不可能。 回顾过去,某些步骤本来可以做得更好。 但是,嘿,生活就是这样。

The problems we tried to solve:

我们试图解决的问题:

  • Allow meetup organizers to identify trending topics related to their meetup. We computed Trending Topics based on the description of the events matching the tags of interest to us. We did this using the John Snow Labs Spark NLP library for extracting entities.

    允许聚会组织者确定与聚会相关的热门话题。 我们根据与我们感兴趣的标签匹配的事件的描述计算了趋势主题。 我们使用John Snow Labs Spark NLP库提取实体进行了此操作。
  • Determine which Meetup events attract the most responses within our region. Therefore we monitored the RSVPs for meetups based on certain tags, related to our domain of interest — Big Data.

    确定哪些聚会活动在我们区域内吸引最多的响应。 因此,我们基于与我们感兴趣的领域(大数据)相关的某些标签监视了RSVP的聚会。

For this we developed 2 sets of visualizations:

为此,我们开发了两组可视化:

  • Trending Keywords

    热门关键字
  • RSVPs Distribution

    RSVP分配

建筑 (Architecture)

The first 2 elements are common in both sets of visualizations. This is the part that reads data from the Meetup.com API and saves it in 2 Kafka Topics.

前两个元素在两组可视化中都是相同的。 这是从Meetup.com API读取数据并将其保存在2个Kafka主题中的部分。

  1. The Stream Reader script fetches data on Yes RSVPs filtered by certain tags from the Meetup Stream API. It then selects the relevant columns that we need. After that it saves this data into the rsvps_filtered_stream Kafka topic.

    Stream Reader脚本从Meetup Stream API中获取由某些标签过滤的Yes RSVP上的数据。 然后,选择我们需要的相关列。 之后,它将这些数据保存到rsvps_filtered_stream Kafka主题中。
  2. For each RSVP, the Stream Reader script then fetches event data for it, only if the event_id does not exist in the events.idx file. This way we make sure that we read event data only once. The setup for the Stream Reader script can be found -> Install Kafka and fetch RSVPs

    对于每个RSVP,仅当event.idx文件中不存在event_id时,Stream Reader脚本然后为其获取事件数据。 这样,我们确保只读取一次事件数据。 可以找到Stream Reader脚本的设置-> 安装Kafka并获取RSVP

热门关键字 (Trending Keywords)

3. The Spark ML — NER Annotator reads data from the Kafka topic events and then applies a Named Entity Recognition Pipeline with Spark NLP. Finally it saves the annotated data in the Kafka topic TOPIC_KEYWORDS. The Notebook with the code can be found here.

3. Spark ML — NER注释器从Kafka主题事件中读取数据,然后通过Spark NLP应用命名实体识别管道。 最后,它将带注释的数据保存在Kafka主题TOPIC_KEYWORDS中。 带有代码的笔记本可以在这里找到。

4. Using KSQL we create 2 subsequent streams to transform the data and finally 1 table that will be used by Spark for the visualization. In Big Data Architectures, SQL Engines only build a logical object that assign metadata to the physical layer objects. In our case these were the streams we built on top of the topics. We link data from the TOPIC_KEYWORDS to a new stream via KSQL, called KEYWORDS. Then, using a Create as Select, we create a new stream, EXPLODED_KEYWORDS, for exploding the data since all of the keywords were in an array. Now we have 1 row for each keyword. Next on, we count the occurrences of each keyword and save it into a table, KEYWORDS_COUNTED. The steps to set up the streams and the tables with the KSQL code can be found here: Kafka — Detailed Architecture.

4.使用KSQL,我们创建2个后续流以转换数据,最后创建1个表,Spark将这些表用于可视化。 在大数据体系结构中,SQL引擎仅构建一个逻辑对象,该逻辑对象将元数据分配给物理层对象。 在我们的案例中,这些是我们在主题之上构建的流。 我们通过KSQL将数据从TOPIC_KEYWORDS链接到新流,称为KEYWORDS。 然后,由于所有关键字都在数组中,因此使用“创建为选择内容”创建一个新流EXPLODED_KEYWORDS,用于展开数据。 现在,每个关键字有1行。 接下来,我们计算每个关键字的出现次数并将其保存到表KEYWORDS_COUNTED中。 可以在此处找到使用KSQL代码设置流和表的步骤: Kafka —详细架构

5. Finally, we use Vegas library to produce the visualizations on Trending Keywords. The Notebook describing all steps can be found here.

5.最后,我们使用Vegas库来生成趋势关键词的可视化。 描述所有步骤的笔记本可以在这里找到。

NER管道的详细说明 (Detailed Explanation of the NER Pipeline)

In order to annotate the data, we need to transform it into a certain format, from text to numbers, and then back to text.

为了注释数据,我们需要将其转换为某种格式,从文本到数字,然后再回到文本。

  1. We first use a DocumentAssembler to turn the text into a Document type.

    我们首先使用DocumentAssembler将文本转换为Document类型。
  2. Then, we break the document into sentences using a SentenceDetector.

    然后,我们使用SentenceDetector将文档分解为句子。
  3. After this we separate the text into smaller units by finding the boundaries of words using a Tokenizer.

    之后,我们通过使用Tokenizer查找单词的边界,将文本分成较小的单元。
  4. Next we remove HTML tags and numerical tokens from the text using a Normalizer.

    接下来,我们使用规范化器从文本中删除HTML标记和数字标记。
  5. After the preparation and cleaning of the text we need to transform it into a numerical format, vectors. We use an English pre-trained WordEmbeddingsModel.

    在准备和清理文本之后,我们需要将其转换为数字格式,即矢量。 我们使用英语预训练的WordEmbeddingsModel。
  6. Next comes the actual keyword extraction using an English NerDLModel Annotator. NerDL stands for Named Entity Recognition Deep Learning.

    接下来是使用英语NerDLModel Annotator进行实际的关键字提取。 NerDL代表命名实体识别深度学习。
  7. Further on we need to transform the numbers back into a human readable format, a text. For this we use a NerConverter and save the results in a new column called entities.

    进一步,我们需要将数字转换回人类可读的格式,即文本。 为此,我们使用NerConverter并将结果保存在称为实体的新列中。
  8. Before applying the model to our data, we need to run an empty training step. We use the fit method on an empty dataframe because the model is pretrained.

    在将模型应用于我们的数据之前,我们需要运行一个空的训练步骤。 因为模型是预先训练的,所以我们在空的数据帧上使用fit方法。
  9. Then we apply the pipeline to our data and select only the fields that we’re interested in.

    然后,我们将管道应用到我们的数据,并仅选择我们感兴趣的字段。
  10. Finally we write the data in Kafka:TOPIC_KEYWORDS

    最后,我们将数据写入Kafka:TOPIC_KEYWORDS

RSVP分配 (RSVPs Distribution)

Image for post
RSVPs Distribution Arhitecture
RSVP分配架构

3. Using KSQL we aggregate and join data from the 2 topics to create 1 Stream, RSVPS_JOINED_DATA, and subsequently 1 Table, RSVPS_FINAL_TABLE containing all RSVPs counts. The KSQL operations and their code can be found here: Kafka — Detailed Architecture

3.使用KSQL,我们聚合并结合来自2个主题的数据以创建1个流RSVPS_JOINED_DATA,随后创建1个表RSVPS_FINAL_TABLE,其中包含所有RSVP计数。 KSQL操作及其代码可以在这里找到: Kafka —详细的体系结构

4. Finally, we use Vegas library to produce visualizations on the distribution of RSVPs around the world and in Romania. The Zeppelin notebook can be found here.

4.最后,我们使用维加斯图书馆来生成关于全球和罗马尼亚的RSVP分布的可视化图像。 齐柏林飞艇笔记本可以在这里找到。

基础设施 (Infrastructure)

We used a machine from Hetzner Cloud with the following specs — CPU: Intel Xeon E3–1275v5 (4 cores/8 threads), Storage: 2×480 GB SSD (RAID 0), RAM: 64GB

我们使用的Hetzner Cloud机器具有以下规格-CPU:Intel Xeon E3-1275v5(4核/ 8线程),存储:2×480 GB SSD(RAID 0),RAM:64GB

可视化 (Visualizations)

RSVP分配 (RSVPs Distribution)

These visualizations are done on data between 8th of May 22:15 UTC and 4th of June 11:23 UTC.

这些可视化在5月8日22:15 UTC和6月4日11:23 UTC之间的数据上完成。

Worldwide — Top Countries by Number of RSVPs

全球范围—按RSVP数量排名前列的国家

Image for post

Worldwide — Top Cities by Number of RSVPs

全球范围-回复人数最高的城市

Image for post
As you can see, most of the RSVPs occur in the United States, but the city with the highest number of RSVPs is London.
如您所见,大多数RSVP都发生在美国,但RSVP数量最多的城市是伦敦。

Worldwide — Top Events by Number of RSVPs

全球范围—按RSVP数量划分的热门事件

Image for post

Romania — Top Cities in Romania by Number of RSVPs

罗马尼亚-罗马尼亚受回复人数最多的城市

Image for post
As you can see, most of the RSVPs are in the largest cities of the country. This is probably due to the fact that companies tend to establish their offices here and therefore attract talent to these places.
如您所见,大多数RSVP都位于该国最大的城市。 这可能是由于公司倾向于在这里设立办事处,因此吸引人才到这些地方。

Romania — Top Meetup Events

罗马尼亚—顶级聚会活动

Image for post

Romania — RSVPs Distribution

罗马尼亚— RSVP分配

Image for post
* This was produced with Grafana using RSVP data processed in Spark and saved locally.
*这是由Grafana使用在Spark中处理的RSVP数据生成的,并保存在本地。

Europe — RSVPs Distribution

欧洲-RSVP分配

Image for post
* This was produced with Grafana using RSVP data processed in Spark and saved locally.
*这是由Grafana使用在Spark中处理的RSVP数据生成的,并保存在本地。

热门关键字 (Trending Keywords)

Worldwide

全世界

Image for post
This visualization is done on data from July.
此可视化是针对7月份的数据完成的。

Romania

罗马尼亚

This visualization is done on almost 1 week of data from the start of August. The reason for this is detailed in Issues encountered section, point 5.

从8月初开始,在近1周的数据中完成了这种可视化。 遇到问题的部分第5点中详细说明了此原因。

Image for post

一路发现的问题 (Issues discovered along the way)

All of these are mentioned in the published Notebooks as well.

所有这些在出版的笔记本中也有提及。

  1. Visualizing data using Helium Zeppelin add-on and Vegas library directly from the stream did not work. We had to spill the data to disk, then build Dataframes on top of the files and finally do the visualizations.

    直接使用流中的Helium Zeppelin附加组件和Vegas库对数据进行可视化是行不通的。 我们必须将数据洒到磁盘上,然后在文件之上构建数据框,最后进行可视化。
  2. Spark NLP did not work for us in a Spark standalone local cluster installation (with local file system). Standalone Local Cluster means that the cluster runs on the same physical machine — Spark Cluster Manager & Workers. Such a setup does not need distributed storage such as HDFS. The workaround for us was to configure Zeppelin to use local Spark, local (*), meaning a non-distributed single-JVM deployment mode available in Zeppelin.

    在Spark独立本地集群安装(带有本地文件系统)中,Spark NLP对我们不起作用。 独立本地群集意味着该群集在同一物理计算机上运行-Spark Cluster Manager&Workers。 这样的设置不需要像HDFS这样的分布式存储。 我们的解决方法是将Zeppelin配置为使用本地Spark(本地(*)),这意味着Zeppelin中提供了非分布式单JVM部署模式。
  3. Vegas plug-in could not be enabled initially. Running the github — %dep z.load(“org.vegas-viz:vegas_2.11:{vegas-version}”) — recommendation always raised an error. The workaround was to add all the dependencies manually in /opt/spark/jars. These dependencies can be found when deploying spark shell with the Vegas library — /opt/spark/bin/spark-shell –packages org.vegas-viz:vegas-spark_2.11:0.3.11

    最初无法启用Vegas插件。 运行github —%dep z.load(“ org.vegas-viz:vegas_2.11:{vegas-version}”)—建议始终会引发错误。 解决方法是在/ opt / spark / jars中手动添加所有依赖项。 在Vegas库中部署spark shell时可以找到这些依赖关系-/ opt / spark / bin / spark-shell –packages org.vegas-viz:vegas-spark_2.11:0.3.11

  4. Helium Zeppelin addon did not work/couldn’t be enabled. This too raised an error when enabling it from Zeppelin GUI in our configuration. We did not manage to solve this issue. That’s why we used only Vegas, although it does not support Map visualizations. In the end we got creative a bit — we exported the data and loaded it into Grafana for Map visualizations.

    Help Zeppelin附加组件无法正常工作/无法启用。 在我们的配置中从Zeppelin GUI启用它时,也会引发错误。 我们未能解决此问题。 这就是为什么我们仅使用Vegas的原因,尽管它不支持Map可视化。 最后,我们有了一点创意-我们导出了数据并将其加载到Grafana中以进行地图可视化。
  5. The default retention policy for Kafka is 7 days. This means that data older than 1 week is deleted. For some of the topics we changed this setting, but for some we forgot to do this and therefore we lost the data. This affected our visualization for the Trending Keywords in Romania.

    Kafka的默认保留政策是7天。 这意味着将删除1周之前的数据。 对于某些主题,我们更改了此设置,但是对于某些主题,我们忘记了设置,因此丢失了数据。 这影响了我们对罗马尼亚“趋势关键字”的可视化。

结论与学习要点 (Conclusions & Learning Points)

  • In the world of Big Data you need clarity around the questions you’re trying to answer before building the Data Architecture and then follow through the plan to make sure you’re still working according to those questions. Otherwise, you might end up with something that can’t do what you actually need. It sounds a pretty general statement and pretty “DOH, OBVIOUSLY”. Once we’ve seen the visualizations, we realized that we did not create the Kafka objects according to our initial per country keywords distribution visualization — e.g. we created the count aggregation per all countries, in the KEYWORDS_COUNTED Table. Combine this with the mistake of forgetting to change the Kafka retention period from the default 7 days, by the time we realized the mistake we had lost the historical data as well. Major learning point.

    在大数据世界中,在构建数据体系结构之前, 您需要明确要解决的问题 ,然后按照计划进行操作,以确保您仍根据这些问题进行工作 。 否则,您可能最终会遇到无法满足您实际需求的事情。 这听起来很笼统,很“ DOH,显然”。 看到可视化后,我们意识到我们没有根据最初的每个国家/地区关键字分布可视化来创建Kafka对象-例如,我们在KEYWORDS_COUNTED表中创建了所有国家/地区的计数汇总。 结合错误忘记忘记将Kafka保留期从默认的7天更改为错误,直到我们意识到错误也丢失了历史数据。 主要学习点。

  • Data should be filtered in advance of the ML/NLP process — we should have removed some keywords that don’t exactly make sense such as “de”, “da”. In order to get more relevant insights maybe several rounds of data cleaning and extracting the keywords might be needed.

    数据应在ML / NLP流程之前进行过滤 -我们应该删除一些不完全有意义的关键字,例如“ de”,“ da”。 为了获得更多相关的见解,可能需要几轮数据清理和提取关键字。

  • After seeing the final visualizations we should probably have filtered a bit more some of the obvious words. For example of course Zoom was the highest scoring keyword since by June everybody was running only online meetups mainly on Zoom.

    在看到最终的可视化效果之后,我们可能应该过滤掉一些明显的词。 例如,当然,Zoom是自6月以来得分最高的关键字,每个人都主要在Zoom上仅运行在线聚会。

This study group was a great way for us to learn about an end-to-end solution that uses Kafka to ingest streaming data, Spark to process it and Zeppelin for visualizations. We recommend this experience for anyone interested in learning Big Data technologies together with other passionate people, in a casual and friendly environment.

Ť他的研究小组对我们来说,了解一个终端到终端的解决方案,它使用卡夫卡摄取流数据,星火来处理它和齐柏林为可视化的好方法。 对于有兴趣在休闲和友好的环境中与其他热情的人一起学习大数据技术的人,我们建议您使用这种体验。

This article originally appeared on https://bigdata.ro/2020/08/09/spark-working-group/

本文最初出现在https://bigdata.ro/2020/08/09/spark-working-group/

翻译自: https://towardsdatascience.com/a-big-data-analysis-of-meetup-events-using-spark-nlp-kafka-and-vegas-visualization-af12c0efed92

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值