kafka模型
In one of my previous posts, I took you through the steps that I performed to preprocess Criteo dataset used for prediction of the click through rate on Ads. I also trained ML models to predict labels on the testing data set. You can find the post here.
在我以前的一篇文章中,我带您完成了对Criteo数据集进行预处理的步骤,该数据集用于预测广告的点击率。 我还训练了ML模型,以预测测试数据集上的标签。 你可以在这里找到帖子。
In this post, I will be taking you through the steps that I performed to simulate the process of ML models predicting labels on streaming data.
在本文中,我将指导您完成模拟ML模型预测流数据标签的过程。
When you go through the mentioned post, you will find that I used pyspark on DataBricks notebooks to preprocess the Criteo data. I split the preprocessed data into training set and testing set. In order to export the test data to my local machine as a single parquet file, first I saved the training set in the FileStore in one partition as one file using dataFrameName.coalesce(1).write.
当您浏览上述文章时,您会发现我在DataBricks笔记本电脑上使用pyspark预处理了Criteo数据。 我将预处理的数据分为训练集和测试集。 为了将测试数据作为单个木地板文件导出到本地计算机,首先我使用dataFrameName.coalesce(1).write将训练集在一个分区中的FileStore中保存为一个文件。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/a71cf3cba7cdeb94ca465b430efc2a86.png)
I used mleap to export my trained models as a zip file. In order to use mleap, I had to install mleap-spark from maven, mleap from pypi, and mlflow. Then, I copied the model to the FileStore so I can download to my local machine.
我使用mleap将训练有素的模型导出为zip文件。 为了使用mleap,我必须从maven安装mleap-spark,从pypi安装mleap和mlflow。 然后,我将模型复制到FileStore,以便可以下载到本地计算机。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/66f8025349492b13b785d8a90b88383f.png)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/a138e58f3097824c0e479cd1a20ebcd8.png)
Make sure that the version of the mleap-pypi version matches the mleap-maven version. For learning the version of the mleap-pypi installed on DataBricks , you can do the following:
确保mleap-pypi版本与mleap-maven版本匹配。 要了解DataBricks上安装的mleap-pypi的版本,可以执行以下操作:
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/5bd61029cf5616b44afb760b0fe3c004.png)
You can learn out the version of the mleap-maven version by looking through the coordinates e.g. ml.combust.mleap:mleap-spark_2.11:0.16.0.
您可以通过查看坐标(例如ml.combust.mleap:mleap-spark_2.11:0.16.0)来了解mleap-maven版本的版本。
After downloading my models and the testing dataset on my local machine, I had a docker compose up running with Kafka, Zookeeper, Logstash, Elasticsearch and Kibana.
在将我的模型和测试数据集下载到本地计算机上之后,我让docker组成了Kafka , Zookeeper , Logstash , Elasticsearch和Kibana 。
I Developed a producer where I used “pyarrow” library to read the parquet file that has the test dataset. The producer then sends the label (class decision) and the features column to Kafka in a streaming fashion.
我开发了一个生产商,在其中使用“ pyarrow”库读取包含测试数据集的镶木地板文件。 生产者然后以流方式将标签(类决策)和功能列发送给Kafka。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/b07ef63e78d7020c15975b54c0ae4331.png)
I Developed a consumer/producer where the consumer part:
我开发了一个消费者/生产者,其中消费者部分:
consumes the label (class decision) and the features column from Kafka and deserializes the logistic regression model and the SVM model. Converts the features column from a sparse vector to a dense vector. Uses the two models to predict a class label from the input features column.
使用Kafka的标签(类决策)和功能列,并反序列化Logistic回归模型和SVM模型。 将要素列从稀疏向量转换为密集向量。 使用两个模型从输入要素列预测类标签。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/685db139f03fde793fb77f5c2e451a2a.png)
The producer part:
生产者部分:
Writes the prediction along with the original label and whether the output was correct or not to Kafka. A value of 1 for correct indicates that the model’s prediction and the original label match, 0 indicates that they didn’t match.
将预测以及原始标签以及输出是否正确写入Kafka。 正确的值为1表示模型的预测与原始标签匹配,为0表示它们不匹配。
I developed a Logstash configuration file that reads the prediction along with the original label and the correct value from Kafka, converts the values as json data and writes them to Elasticsearch.
我开发了一个Logstash配置文件,该文件从Kafka读取预测以及原始标签和正确的值,并将这些值转换为json数据并将其写入Elasticsearch 。
I connected to Kibana “localhost:5601/app/kibana#/discover”. I used “visualize” to create two pie charts one for each model (logistic regression and SVM models) that shows the percentage of the count of the correct predictions versus the incorrect predictions. I aslo created a “mark down” that has the title that describes the two pie charts. I then created a dashboard that shows the two pie charts side by side with the mark down as the title of the dashboard. The pie charts gets updated according to the streaming data.
我连接到Kibana “ localhost:5601 / app / kibana#/ discover ”。 我使用“可视化”为每个模型(逻辑回归和SVM模型)创建了两个饼图,分别显示了正确预测与错误预测的计数百分比。 我还创建了一个“标记下来”,其标题描述了两个饼图。 然后,我创建了一个仪表板,该仪表板并排显示两个饼形图,其中标记向下是该仪表板的标题。 饼图根据流数据进行更新。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/3806b12a2028e7ef58d87ba99288ebca.png)
I hope, you enjoyed my post and found it useful, for full code of the producer, consumer/producer, logstash configuration file, and the docker compose visit my git hub by clicking here.
希望您喜欢我的文章,并觉得它对我有用,对生产者,消费者/生产者,logstash配置文件和docker compose的完整代码,请单击此处访问我的git hub。
kafka模型