Python Spark 链接 MongoDB

最新推荐文章于 2023-02-21 10:56:01 发布

songhao8080

最新推荐文章于 2023-02-21 10:56:01 发布

阅读量182

点赞数

本文链接：https://blog.csdn.net/songhao8080/article/details/103670119

版权

Mongo Spark Connector Python

Prerequisites

Have MongoDB up and running and Spark 2.2.x downloaded. See the introduction and the SQL
for more information on getting started.

You can run the interactive pyspark shell like so:

Python

./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \ --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \ --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3

. / bin / pyspark -- conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \

-- conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \

-- packages org . mongodb . spark : mongo - spark - connector_2 . 11 : 2.2.3

The Python API Basics

The python API works via DataFrames and uses underlying Scala DataFrame.

DataFrames and Datasets

Creating a dataframe is easy you can either load the data via DefaultSource ("com.mongodb.spark.sql.DefaultSource").

First, in an empty collection we load the following data:

Python

charactersRdd = sc.parallelize([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)]) characters = sqlContext.createDataFrame(charactersRdd, ["name", "age"]) characters.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()

charactersRdd = sc . parallelize ( [ ( "Bilbo Baggins" , 50 ) , ( "Gandalf" , 1000 ) , ( "Thorin" , 195 ) , ( "Balin" , 178 ) , ( "Kili" , 77 ) ,

( "Dwalin" , 169 ) , ( "Oin" , 167 ) , ( "Gloin" , 158 ) , ( "Fili" , 82 ) , ( "Bombur" , None ) ] )

characters = sqlContext . createDataFrame ( charactersRdd , [ "name" , "age" ] )

characters . write . format ( "com.mongodb.spark.sql.DefaultSource" ) . mode ( "overwrite" ) . save ( )

Then to load the characters into a DataFrame via the standard source method:

Python

df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load() df.printSchema()

df = sqlContext . read . format ( "com.mongodb.spark.sql.DefaultSource" ) . load ( )

df . printSchema ( )

Will return:

Python

root |-- _id: string (nullable = true) |-- age: integer (nullable = true) |-- name: string (nullable = true)

root

| -- _id : string ( nullable = true )

| -- age : integer ( nullable = true )

| -- name : string ( nullable = true )

Alternatively, you can specify the database and collection while reading the dataframe:

Python

df = spark.read.format("com.mongodb.spark.sql.DefaultSource")\ .option("spark.mongodb.input.uri", "mongodb://<host>:<port>/<db>.<collection>").load()

df = spark . read . format ( "com.mongodb.spark.sql.DefaultSource" ) \

. option ( "spark.mongodb.input.uri" , "mongodb://<host>:<port>/<db>.<collection>" ) . load ( )

And to write a dataframe to a collection:

Python

df.write.format("com.mongodb.spark.sql.DefaultSource")\ .option("spark.mongodb.output.uri", "mongodb://<host>:<port>/<db>.<collection>").save()

df . write . format ( "com.mongodb.spark.sql.DefaultSource" ) \

. option ( "spark.mongodb.output.uri" , "mongodb://<host>:<port>/<db>.<collection>" ) . save ( )

SQL

Just like the Scala examples, SQL can be used to filter data. In the following example we register a temp table and then filter and output
the characters with ages under 100:

Python

df.registerTempTable("characters") centenarians = sqlContext.sql("SELECT name, age FROM characters WHERE age >= 100") centenarians.show()

df . registerTempTable ( "characters" )

centenarians = sqlContext . sql ( "SELECT name, age FROM characters WHERE age >= 100" )

centenarians . show ( )

Outputs:

Python

+-------+----+ | name| age| +-------+----+ |Gandalf|1000| | Thorin| 195| | Balin| 178| | Dwalin| 169| | Oin| 167| | Gloin| 158| +-------+----+

+ -- -- -- - + -- -- +

| name | age |

+ -- -- -- - + -- -- +

| Gandalf | 1000 |

| Thorin | 195 |

| Balin | 178 |

| Dwalin | 169 |

| Oin | 167 |

| Gloin | 158 |

+ -- -- -- - + -- -- +

zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱

songhao8080

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python Spark 链接 MongoDB

Mongo Spark Connector PythonPrerequisitesHave MongoDB up and running and Spark 2.2.x downloaded. See the introduction and the SQLfor more information on gettin...
复制链接

扫一扫