Spark先进行过滤再读取MongoDB数据库

一、官方连接器 Spark Connector

  本来MongoDB官方提供了Spark 连接 MongoDB的连接器,其实用起来也挺方便的。但是吧,leader以前一直都是使用flink的DataSet,Flink的DataSet在读取MongoDB数据库的时候,是可以先进行一个过滤再读过来,所以一开始我读MongoDB大概花了十分钟,leader觉得就是因为没有过滤,需要的那一天的数据,才导致这么慢。然而MongoDB官方提供的Connector是不支持先过滤再读过来的,它是全部数据读过来再进行一个过滤。所以就自己想办法先过滤再获取全部数据了。
  官方连接器的使用案例就不写了,网上一搜也一大堆。
  官方地址:
    Spark Connector

二、使用Hadoop格式读取MongoDB数据

  这个方式其实就是借鉴的 Flink的DataSet读取MongoDB的方式。

1. 实现

  依赖:

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.3.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.3.4</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver -->
        
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-java-driver</artifactId>
            <version>3.8.2</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb.mongo-hadoop</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>2.0.2</version>
        </dependency>

    </dependencies>

  代码:

import java.text.SimpleDateFormat
import java.util.Date

import com.google.gson.Gson
import com.mongodb.hadoop.MongoInputFormat
import org.bson.{BSONObject, Document}
import org.apache.hadoop.conf.Configuration
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * @Author: fseast
 * @Date: 2020/4/9 下午5:44
 * @Description:
 */
object SparkMongoDBConnect {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkMongoDBConnect").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val sdf = new SimpleDateFormat("yyyy-MM-dd")
    val date = sdf.parse("2020-03-02")
    val date2 = sdf.parse("2020-03-03")
    println(date)
    val config = new Configuration()
    //相关配置参考:https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference
    conf.set("mongo.input.split.create_input_splits", "false")
    config.set("mongo.input.uri", "mongodb://mongodb://127.0.0.1/test-community.setup?authSource=admin")
    val doc = new Document("launch_time", new Document()
      .append("$gte", date.getTime / 1000L)
      .append("$lt", date2.getTime / 1000L))
      //先过滤出launch_time字段大于等于3月2日零点的时间戳以及小于3月3日零点时间戳
    config.set("mongo.input.query",new Gson().toJson(doc))

    //导包没导对会报错,要导 com.mongodb.hadoop.MongoInputFormat 而不是 com.mongodb.hadoop.mapred.MongoInputFormat
    //val sourceRdd = sc.newAPIHadoopRDD(config, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject])
    val sourceRdd = sc.newAPIHadoopRDD(config, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject])
    
    sourceRdd.foreach(println)
    println(sourceRdd.count())
    
    
    sc.stop()
  }
}

2. 出现的错误

主要是这一句:

Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.

可是我的MongoDB是3.6的。
完整报错如下:

Exception in thread "main" java.io.IOException: com.mongodb.hadoop.splitter.SplitFailedException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
	at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:62)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:127)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at com.fseast.spark.SparkMongoDBConnect$.main(SparkMongoDBConnect.scala:48)
	at com.fseast.spark.SparkMongoDBConnect.main(SparkMongoDBConnect.scala)
Caused by: com.mongodb.hadoop.splitter.SplitFailedException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
	at com.mongodb.hadoop.splitter.SampleSplitter.calculateSplits(SampleSplitter.java:84)
	at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:60)
	... 14 more
Caused by: com.mongodb.MongoCommandException: Command failed with error 9: 'The 'cursor' option is required, except for aggregate with the explain argument' on server 127.0.0.1:27017. The full response is { "ok" : 0.0, "errmsg" : "The 'cursor' option is required, except for aggregate with the explain argument", "code" : 9, "codeName" : "FailedToParse" }
	at com.mongodb.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:86)
	at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:120)
	at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
	at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
	at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:173)
	at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:215)
	at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:206)
	at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:112)
	at com.mongodb.operation.AggregateOperation$1.call(AggregateOperation.java:227)
	at com.mongodb.operation.AggregateOperation$1.call(AggregateOperation.java:223)
	at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:239)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:212)
	at com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:223)
	at com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:65)
	at com.mongodb.Mongo.execute(Mongo.java:772)
	at com.mongodb.Mongo$2.execute(Mongo.java:759)
	at com.mongodb.DBCollection.aggregate(DBCollection.java:1377)
	at com.mongodb.DBCollection.aggregate(DBCollection.java:1308)
	at com.mongodb.DBCollection.aggregate(DBCollection.java:1294)
	at com.mongodb.hadoop.splitter.SampleSplitter.calculateSplits(SampleSplitter.java:82)
	... 15 more

解决:
  一开始我是没有导mongo的java驱动,但是在读取一部分的数据库是正常的,一部分会报错,所以我就没往这方面想,后面在Google的时候有人提到这个,我就加了一下这个依赖,就可以了:

<dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-java-driver</artifactId>
            <version>3.6.2</version>
</dependency>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值