Python(pyspark)和Scala连接MongoDB
最近在使用spark读取MongoDB的数据,处理后再存回MongoDB,这里把存取的方式做个小总结。
准备工作
首先需要去Maven库下载连接MongoDB的jar包
Python(pyspark)连接MongoDB
读取MongoDB:
# 创建sparkSession对象
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config('spark.debug.maxToStringFields', '100') \ # 指定字段值的大小,如果字段值大于词设定值,会有类似Truncated the string representation of a plan since it was too large的警告
.getOrCreate()
pwd = parse.quote_plus("Gouuse@spider") # MongoDB数据库的密码
# 连接MongoDB
data = my_spark.read.format("com.mongodb.spark.sql").option("spark.mongodb.input.uri",
"mongodb://gouuse:{}@127.0.0.0:27017/testdb.myCollection"
.format(pwd)).load()
写出到MongoDB:
data.write.format("com.mongodb.spark.sql").option("spark.mongodb.output.uri",
"mongodb://gouuse:{}@127.0.0.0:27017/testdb.mycollection"
.format(pwd)).mode("overwrite").option('batchsize', '1000').save()
Scala连接MongoDB
# 方式一
import org.apache.spark.sql.SparkSession
object test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("MongoSparkConnectorIntro")
.getOrCreate()
val data= spark.read.format("com.mongodb.spark.sql").option("spark.mongodb.input.uri",
"mongodb://localhost:27017/testdb.hero").load()
data.show(33,false)
}
}
# 方式二
import org.apache.spark.sql.SparkSession
import com.mongodb.spark._
object test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/testdb.ets_linkedin_v000020_weight_yp")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection")
.getOrCreate()
val rdd = MongoSpark.load(spark)
rdd.toDF().show(33, false)
}
}
未完待续…