Python Spark 链接 MongoDB

Mongo Spark Connector Python

Prerequisites

Have MongoDB up and running and Spark 2.2.x downloaded. See the introduction and the SQL
for more information on getting started.

You can run the interactive pyspark shell like so:

Python
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \ --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \ --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3
1
2
3
4
. / bin / pyspark -- conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \
               -- conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \
               -- packages org . mongodb . spark : mongo - spark - connector_2 . 11 : 2.2.3
 

The Python API Basics

The python API works via DataFrames and uses underlying Scala DataFrame.

DataFrames and Datasets

Creating a dataframe is easy you can either load the data via DefaultSource ("com.mongodb.spark.sql.DefaultSource").

First, in an empty collection we load the following data:

Python
charactersRdd = sc.parallelize([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)]) characters = sqlContext.createDataFrame(charactersRdd, ["name", "age"]) characters.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()
1
2
3
4
5
charactersRdd = sc . parallelize ( [ ( "Bilbo Baggins" ,    50 ) , ( "Gandalf" , 1000 ) , ( "Thorin" , 195 ) , ( "Balin" , 178 ) , ( "Kili" , 77 ) ,
                                 ( "Dwalin" , 169 ) , ( "Oin" , 167 ) , ( "Gloin" , 158 ) , ( "Fili" , 82 ) , ( "Bombur" , None ) ] )
characters = sqlContext . createDataFrame ( charactersRdd , [ "name" , "age" ] )
characters . write . format ( "com.mongodb.spark.sql.DefaultSource" ) . mode ( "overwrite" ) . save ( )
 

Then to load the characters into a DataFrame via the standard source method:

Python
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load() df.printSchema()
1
2
3
df = sqlContext . read . format ( "com.mongodb.spark.sql.DefaultSource" ) . load ( )
df . printSchema ( )
 

Will return:

Python
root |-- _id: string (nullable = true) |-- age: integer (nullable = true) |-- name: string (nullable = true)
1
2
3
4
5
root
| -- _id : string ( nullable = true )
| -- age : integer ( nullable = true )
| -- name : string ( nullable = true )
 

Alternatively, you can specify the database and collection while reading the dataframe:

Python
df = spark.read.format("com.mongodb.spark.sql.DefaultSource")\ .option("spark.mongodb.input.uri", "mongodb://<host>:<port>/<db>.<collection>").load()
1
2
3
df = spark . read . format ( "com.mongodb.spark.sql.DefaultSource" ) \
     . option ( "spark.mongodb.input.uri" , "mongodb://<host>:<port>/<db>.<collection>" ) . load ( )
 

And to write a dataframe to a collection:

Python
df.write.format("com.mongodb.spark.sql.DefaultSource")\ .option("spark.mongodb.output.uri", "mongodb://<host>:<port>/<db>.<collection>").save()
1
2
3
df . write . format ( "com.mongodb.spark.sql.DefaultSource" ) \
     . option ( "spark.mongodb.output.uri" , "mongodb://<host>:<port>/<db>.<collection>" ) . save ( )
 

SQL

Just like the Scala examples, SQL can be used to filter data. In the following example we register a temp table and then filter and output
the characters with ages under 100:

Python
df.registerTempTable("characters") centenarians = sqlContext.sql("SELECT name, age FROM characters WHERE age >= 100") centenarians.show()
1
2
3
4
df . registerTempTable ( "characters" )
centenarians = sqlContext . sql ( "SELECT name, age FROM characters WHERE age >= 100" )
centenarians . show ( )
 

Outputs:

Python
+-------+----+ | name| age| +-------+----+ |Gandalf|1000| | Thorin| 195| | Balin| 178| | Dwalin| 169| | Oin| 167| | Gloin| 158| +-------+----+
1
2
3
4
5
6
7
8
9
10
11
+ -- -- -- - + -- -- +
|    name | age |
+ -- -- -- - + -- -- +
| Gandalf | 1000 |
| Thorin | 195 |
|    Balin | 178 |
| Dwalin | 169 |
|      Oin | 167 |
|    Gloin | 158 |
+ -- -- -- - + -- -- +
 



  • zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值