1. 概述
本文描述的是spark 学习的第四阶段知识点,主要目的是实现 spark streaming + flume + log4j + mongoDB 的端对端演示,它的基础是 <Spark 阶段总结 3> 所介绍演示。本文对应的github URL: https://github.com/riverlight/spark-study-1。
2. mongoDB 安装及调用
安装URL: http://www.runoob.com/mongodb/mongodb-linux-install.html
Scala 操作mongoDB: http://blog.csdn.net/yaoyasong/article/details/39698339
3. mfs
3.1 代码 mfs.scala
package com.leon
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.commons.MongoDBObject
import org.apache.spark.streaming.flume._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.Seconds
import org.apache.spark.storage.StorageLevel
/**
* Created by leon on 2016/1/21.
*/
object mfs {
def main(args: Array[String]): Unit = {
println("Hi, this is a mongodb+flume+sparkdemo program")
if (args.length < 2) {
print("please enter host and port")
//System.exit(1)
}
val mongoClient = MongoClient("192.168.227.132", 27017)
val db = mongoClient("sca")
db.collectionNames
val mysca = db("mysca")
val sc = new SparkConf().setAppName("FlumeEventCount")
val ssc = new StreamingContext(sc, Seconds(20))
val hostname = args(0)
val port = args(1).toInt
val storageLevel = StorageLevel.MEMORY_ONLY
println(hostname + " " +port)
val flumeStream = FlumeUtils.createPollingStream(ssc,hostname, port, storageLevel)
flumeStream.foreachRDD( rdd => {
//rdd.count().map( cnt => "Received" + cnt + " flume events." ).print()
print(rdd.count().toString())
val count1 = MongoDBObject("count" -> rdd.count())
mysca.insert(count1)
})
// flumeStream.count().map(cnt =>"Received " + cnt + " flume events." ).print()
//
// val count1 =MongoDBObject("count" -> flumeStream.count())
// mysca.insert(count1)
ssc.start()
//计算完毕退出
ssc.awaitTermination()
}
}
3.2 说明
注意 foreachRDD函数和下面被注释的代码,最初用的是注释代码试图写数据库,会出错,用 foreachRDD 才可以解决。