源码的github地址
https://github.com/mongodb/mongo-spark,是mongodb发布的spark connection接口库,可以方便的使用spark读写mongodb数据
1.rdd写入mongodb
两种方式将生成的rdd写入mongodb,事例代码:
val sc = getSparkContext(args)
import com.mongodb.spark._
import org.bson.Document
val documents = sc.parallelize((1 to 10).map(i => Document.parse(s"{test: $i}")))
MongoSpark.save(documents)
import com.mongodb.spark.config._
val writeConfig = WriteConfig(Map("collection" -> "spark", "writeConcern.w" -> "majority"), Some(WriteConfig(sc)))
val sparkDocuments = sc.parallelize((1 to 10).map(i => Document.parse(s"{spark: $i}")))
MongoSpark.save(sparkDocuments, writeConfig)
调用函数如下
MongoSpark.save(rdd)
MongoSpark.save(rdd, writeConfig))
看MongoSpark.save的定义:def save[D: ClassTag](rdd: RDD[D]): Unit = save(rdd, WriteConfig(rdd.sparkContext)),
实际最终都调用到了MongoSpark.save(rdd, writeConfig)),来看这个函数实现:
def save[D: ClassTag](rdd: RDD[D], writeConfig: WriteConfig): Unit = {
val mongoConnector = MongoConnector(writeConfig.asOptions)
rdd.foreachPartition(iter => if (iter.nonEmpty) {
mongoConnector.withCollecti
|