Mongo Spark Connector Python
Prerequisites
Have MongoDB up and running and Spark 2.2.x downloaded. See the introduction and the SQL
for more information on getting started.
You can run the interactive pyspark shell like so:
1
2
3
4
|
.
/
bin
/
pyspark
--
conf
"spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred"
\
--
conf
"spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll"
\
--
packages
org
.
mongodb
.
spark
:
mongo
-
spark
-
connector_2
.
11
:
2.2.3
|
The Python API Basics
The python API works via DataFrames and uses underlying Scala DataFrame.
DataFrames and Datasets
Creating a dataframe is easy you can either load the data via DefaultSource
("com.mongodb.spark.sql.DefaultSource").
First, in an empty collection we load the following data:
1
2
3
4
5
|
charactersRdd
=
sc
.
parallelize
(
[
(
"Bilbo Baggins"
,
50
)
,
(
"Gandalf"
,
1000
)
,
(
"Thorin"
,
195
)
,
(
"Balin"
,
178
)
,
(
"Kili"
,
77
)
,
(
"Dwalin"
,
169
)
,
(
"Oin"
,
167
)
,
(
"Gloin"
,
158
)
,
(
"Fili"
,
82
)
,
(
"Bombur"
,
None
)
]
)
characters
=
sqlContext
.
createDataFrame
(
charactersRdd
,
[
"name"
,
"age"
]
)
characters
.
write
.
format
(
"com.mongodb.spark.sql.DefaultSource"
)
.
mode
(
"overwrite"
)
.
save
(
)
|
Then to load the characters into a DataFrame via the standard source method:
1
2
3
|
df
=
sqlContext
.
read
.
format
(
"com.mongodb.spark.sql.DefaultSource"
)
.
load
(
)
df
.
printSchema
(
)
|
Will return:
1
2
3
4
5
|
root
|
--
_id
:
string
(
nullable
=
true
)
|
--
age
:
integer
(
nullable
=
true
)
|
--
name
:
string
(
nullable
=
true
)
|
Alternatively, you can specify the database and collection while reading the dataframe:
1
2
3
|
df
=
spark
.
read
.
format
(
"com.mongodb.spark.sql.DefaultSource"
)
\
.
option
(
"spark.mongodb.input.uri"
,
"mongodb://<host>:<port>/<db>.<collection>"
)
.
load
(
)
|
And to write a dataframe to a collection:
1
2
3
|
df
.
write
.
format
(
"com.mongodb.spark.sql.DefaultSource"
)
\
.
option
(
"spark.mongodb.output.uri"
,
"mongodb://<host>:<port>/<db>.<collection>"
)
.
save
(
)
|
SQL
Just like the Scala examples, SQL can be used to filter data. In the following example we register a temp table and then filter and output
the characters with ages under 100:
1
2
3
4
|
df
.
registerTempTable
(
"characters"
)
centenarians
=
sqlContext
.
sql
(
"SELECT name, age FROM characters WHERE age >= 100"
)
centenarians
.
show
(
)
|
Outputs:
1
2
3
4
5
6
7
8
9
10
11
|
+
--
--
--
-
+
--
--
+
|
name
|
age
|
+
--
--
--
-
+
--
--
+
|
Gandalf
|
1000
|
|
Thorin
|
195
|
|
Balin
|
178
|
|
Dwalin
|
169
|
|
Oin
|
167
|
|
Gloin
|
158
|
+
--
--
--
-
+
--
--
+
|