kafka模型_使用Kafka对流数据进行ml模型预测

kafka模型

In one of my previous posts, I took you through the steps that I performed to preprocess Criteo dataset used for prediction of the click through rate on Ads. I also trained ML models to predict labels on the testing data set. You can find the post here.

在我以前的一篇文章中,我带您完成了对Criteo数据集进行预处理的步骤,该数据集用于预测广告的点击率。 我还训练了ML模型,以预测测试数据集上的标签。 你可以在这里找到帖子。

In this post, I will be taking you through the steps that I performed to simulate the process of ML models predicting labels on streaming data.

在本文中,我将指导您完成模拟ML模型预测流数据标签的过程。

When you go through the mentioned post, you will find that I used pyspark on DataBricks notebooks to preprocess the Criteo data. I split the preprocessed data into training set and testing set. In order to export the test data to my local machine as a single parquet file, first I saved the training set in the FileStore in one partition as one file using dataFrameName.coalesce(1).write.

当您浏览上述文章时,您会发现我在DataBricks笔记本电脑上使用pyspark预处理了Criteo数据。 我将预处理的数据分为训练集和测试集。 为了将测试数据作为单个木地板文件导出到本地计算机,首先我使用dataFrameName.coalesce(1).write将训练集在一个分区中的FileStore中保存为一个文件。

Image for post

I used mleap to export my trained models as a zip file. In order to use mleap, I had to install mleap-spark from maven, mleap from pypi, and mlflow. Then, I copied the model to the FileStore so I can download to my local machine.

我使用mleap将训练有素的模型导出为zip文件。 为了使用mleap,我必须从maven安装mleap-spark,从pypi安装mleap和mlflow。 然后,我将模型复制到FileStore,以便可以下载到本地计算机。

Image for post
Image for post

Make sure that the version of the mleap-pypi version matches the mleap-maven version. For learning the version of the mleap-pypi installed on DataBricks , you can do the following:

确保mleap-pypi版本与mleap-maven版本匹配。 要了解DataBricks上安装的mleap-pypi的版本,可以执行以下操作:

Image for post

You can learn out the version of the mleap-maven version by looking through the coordinates e.g. ml.combust.mleap:mleap-spark_2.11:0.16.0.

您可以通过查看坐标(例如ml.combust.mleap:mleap-spark_2.11:0.16.0)来了解mleap-maven版本的版本。

After downloading my models and the testing dataset on my local machine, I had a docker compose up running with Kafka, Zookeeper, Logstash, Elasticsearch and Kibana.

在将我的模型和测试数据集下载到本地计算机上之后,我让docker组成了KafkaZookeeperLogstashElasticsearchKibana

I Developed a producer where I used “pyarrow” library to read the parquet file that has the test dataset. The producer then sends the label (class decision) and the features column to Kafka in a streaming fashion.

我开发了一个生产商,在其中使用“ pyarrow”库读取包含测试数据集的镶木地板文件。 生产者然后以流方式将标签(类决策)和功能列发送给Kafka。

Image for post
Output Sample Showing the Test Data Label and Features Column 输出样本显示“测试数据标签”和“功能”列

I Developed a consumer/producer where the consumer part:

我开发了一个消费者/生产者,其中消费者部分

consumes the label (class decision) and the features column from Kafka and deserializes the logistic regression model and the SVM model. Converts the features column from a sparse vector to a dense vector. Uses the two models to predict a class label from the input features column.

使用Kafka的标签(类决策)和功能列,并反序列化Logistic回归模型和SVM模型。 将要素列从稀疏向量转换为密集向量。 使用两个模型从输入要素列预测类标签。

Image for post
Output Sample Showing the Expected label, Prediction, and Correct for each Model 输出样本显示每个模型的预期标签,预测和正确

The producer part:

生产者部分:

Writes the prediction along with the original label and whether the output was correct or not to Kafka. A value of 1 for correct indicates that the model’s prediction and the original label match, 0 indicates that they didn’t match.

将预测以及原始标签以及输出是否正确写入Kafka。 正确的值为1表示模型的预测与原始标签匹配,为0表示它们不匹配。

I developed a Logstash configuration file that reads the prediction along with the original label and the correct value from Kafka, converts the values as json data and writes them to Elasticsearch.

我开发了一个Logstash配置文件,该文件从Kafka读取预测以及原始标签和正确的值,并将这些值转换为json数据并将其写入Elasticsearch

I connected to Kibanalocalhost:5601/app/kibana#/discover”. I used “visualize” to create two pie charts one for each model (logistic regression and SVM models) that shows the percentage of the count of the correct predictions versus the incorrect predictions. I aslo created a “mark down” that has the title that describes the two pie charts. I then created a dashboard that shows the two pie charts side by side with the mark down as the title of the dashboard. The pie charts gets updated according to the streaming data.

我连接到Kibanalocalhost:5601 / app / kibana#/ discover ”。 我使用“可视化”为每个模型(逻辑回归和SVM模型)创建了两个饼图,分别显示了正确预测与错误预测的计数百分比。 我还创建了一个“标记下来”,其标题描述了两个饼图。 然后,我创建了一个仪表板,该仪表板并排显示两个饼形图,其中标记向下是该仪表板的标题。 饼图根据流数据进行更新。

Image for post
Kibana Dashboard showing accuracy count for ML models on Streaming Data Kibana仪表板显示流数据上ML模型的准确性计数

I hope, you enjoyed my post and found it useful, for full code of the producer, consumer/producer, logstash configuration file, and the docker compose visit my git hub by clicking here.

希望您喜欢我的文章,并觉得它对我有用,对生产者,消费者/生产者,logstash配置文件和docker compose的完整代码,请单击此处访问我的git hub。

翻译自: https://medium.com/@amany.m.abdelhalim/ml-model-prediction-on-streaming-data-using-kafka-ae7e46d2bf10

kafka模型

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值