一、官方连接器 Spark Connector
本来MongoDB官方提供了Spark 连接 MongoDB的连接器,其实用起来也挺方便的。但是吧,leader以前一直都是使用flink的DataSet,Flink的DataSet在读取MongoDB数据库的时候,是可以先进行一个过滤再读过来,所以一开始我读MongoDB大概花了十分钟,leader觉得就是因为没有过滤,需要的那一天的数据,才导致这么慢。然而MongoDB官方提供的Connector是不支持先过滤再读过来的,它是全部数据读过来再进行一个过滤。所以就自己想办法先过滤再获取全部数据了。
官方连接器的使用案例就不写了,网上一搜也一大堆。
官方地址:
Spark Connector
二、使用Hadoop格式读取MongoDB数据
这个方式其实就是借鉴的 Flink的DataSet读取MongoDB的方式。
1. 实现
依赖:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver -->
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.8.2</version>
</dependency>
<dependency>
<groupId>org.mongodb.mongo-hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>2.0.2</version>
</dependency>
</dependencies>
代码:
import java.text.SimpleDateFormat
import java.util.Date
import com.google.gson.Gson
import com.mongodb.hadoop.MongoInputFormat
import org.bson.{BSONObject, Document}
import org.apache.hadoop.conf.Configuration
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @Author: fseast
* @Date: 2020/4/9 下午5:44
* @Description:
*/
object SparkMongoDBConnect {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkMongoDBConnect").setMaster("local[*]")
val sc = new SparkContext(conf)
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val date = sdf.parse("2020-03-02")
val date2 = sdf.parse("2020-03-03")
println(date)
val config = new Configuration()
//相关配置参考:https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference
conf.set("mongo.input.split.create_input_splits", "false")
config.set("mongo.input.uri", "mongodb://mongodb://127.0.0.1/test-community.setup?authSource=admin")
val doc = new Document("launch_time", new Document()
.append("$gte", date.getTime / 1000L)
.append("$lt", date2.getTime / 1000L))
//先过滤出launch_time字段大于等于3月2日零点的时间戳以及小于3月3日零点时间戳
config.set("mongo.input.query",new Gson().toJson(doc))
//导包没导对会报错,要导 com.mongodb.hadoop.MongoInputFormat 而不是 com.mongodb.hadoop.mapred.MongoInputFormat
//val sourceRdd = sc.newAPIHadoopRDD(config, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject])
val sourceRdd = sc.newAPIHadoopRDD(config, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject])
sourceRdd.foreach(println)
println(sourceRdd.count())
sc.stop()
}
}
2. 出现的错误
主要是这一句:
Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
可是我的MongoDB是3.6的。
完整报错如下:
Exception in thread "main" java.io.IOException: com.mongodb.hadoop.splitter.SplitFailedException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:62)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:127)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
at com.fseast.spark.SparkMongoDBConnect$.main(SparkMongoDBConnect.scala:48)
at com.fseast.spark.SparkMongoDBConnect.main(SparkMongoDBConnect.scala)
Caused by: com.mongodb.hadoop.splitter.SplitFailedException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
at com.mongodb.hadoop.splitter.SampleSplitter.calculateSplits(SampleSplitter.java:84)
at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:60)
... 14 more
Caused by: com.mongodb.MongoCommandException: Command failed with error 9: 'The 'cursor' option is required, except for aggregate with the explain argument' on server 127.0.0.1:27017. The full response is { "ok" : 0.0, "errmsg" : "The 'cursor' option is required, except for aggregate with the explain argument", "code" : 9, "codeName" : "FailedToParse" }
at com.mongodb.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:86)
at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:120)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:173)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:215)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:206)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:112)
at com.mongodb.operation.AggregateOperation$1.call(AggregateOperation.java:227)
at com.mongodb.operation.AggregateOperation$1.call(AggregateOperation.java:223)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:239)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:212)
at com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:223)
at com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:65)
at com.mongodb.Mongo.execute(Mongo.java:772)
at com.mongodb.Mongo$2.execute(Mongo.java:759)
at com.mongodb.DBCollection.aggregate(DBCollection.java:1377)
at com.mongodb.DBCollection.aggregate(DBCollection.java:1308)
at com.mongodb.DBCollection.aggregate(DBCollection.java:1294)
at com.mongodb.hadoop.splitter.SampleSplitter.calculateSplits(SampleSplitter.java:82)
... 15 more
解决:
一开始我是没有导mongo的java驱动,但是在读取一部分的数据库是正常的,一部分会报错,所以我就没往这方面想,后面在Google的时候有人提到这个,我就加了一下这个依赖,就可以了:
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.6.2</version>
</dependency>