项目架构与数据流
技术栈选择:
-
实时处理:Spark Structured Streaming(低延迟、Exactly-Once语义)
-
特征存储:HBase(支持快速随机读写、存储用户/物品实时特征)
-
模型训练:Spark MLlib(分布式机器学习,适合周期性训练)
-
消息队列:Kafka(作为行为数据管道,解耦系统)
-
服务层:可选 Spring Boot/Flink(提供推荐接口)
数据流水线:
text
User Behavior Data (Clicks, Views, Purchases)
|
↓ (HTTP/ SDK)
Apache Kafka (Topic: user-behaviors)
|
↓ (Spark Streaming)
Spark Structured Streaming
| (Feature Engineering)
|----------------→ HBase (User/Item Features)
↓
Spark MLlib (Periodic Training)
|
↓ (Model Storage)
HDFS (Trained Model)
|
↓ (Load Model)
Real-Time Recommendation Service
|
↓ (HTTP API)
Client App
阶段一:实时特征工程与存储 (Spark Streaming → HBase)
1. 用户行为数据 Schema
假设 Kafka 中的用户行为消息为 JSON 格式:
json
{
"user_id": "u123",
"item_id": "i456",
"behavior_type": "click", // click, view, purchase, cart
"timestamp": 1715328000000,
"duration": 15.2, // 浏览时长(秒)
"category": "electronics"
}
2. Spark Structured Streaming 处理逻辑
Scala 代码示例 (RealTimeFeatureEngineering.scala):
scala
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.Vectors
object RealTimeFeatureEngineering {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Real-Time Feature Engineering")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
import spark.implicits._
// 1. 从Kafka读取用户行为数据
val behaviorDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "user-behaviors")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as json_value")
// 2. 解析JSON数据
val behaviorSchema = StructType(Seq(
StructField("use

最低0.47元/天 解锁文章
1223

被折叠的 条评论
为什么被折叠?



