MongoDB Connector for Hadoop(一)

1.MongoDB连接器

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem including the following:

  • Pig
  • Spark
  • MapReduce
  • Hadoop Streaming
  • Hive
  • Flume

2.MongoDB连接器 for Hive的两种用法

MongoDB连接器有两种方式使Hive可以使用MongoDB的数据
- Connecting to MongoDB - MongoStorageHandler
- Using BSON files - STORED AS (Specified SerDe, INPUT and OUTPUT)

第一种:直接建立连接,指定映射关系,但是需要注意权限问题(删除表操作!!)

简单原理解析:建立hive表和mongodb表的映射关系,然后查询hive表时会解析向下转发,查询mongodb表,再将结果返回。

第二种:采用BSON文件
从mongodb中导出BSON文件(mongodump),再在Hive中导入BSON文件

本文主要讲述第一种方式,第二种方式请参考MongoDB Connector for Hadoop(二)

3.Connecting to MongoDB - MongoStorageHandler

3.1 安装

下载下面三个java包到hive的lib包下,重启hive

- mongo-hadoop-core-2.0.2.jar
- mongo-hadoop-hive-2.0.2.jar
- mongo-java-driver-3.4.2.jar

或者

<dependency>
    <groupId>org.mongodb.mongo-hadoop</groupId>
    <artifactId>mongo-hadoop-core</artifactId>
    <version>1.5.1</version>
</dependency>

3.2 Quickstart Example

根据mongodb的表字段,选择需要映射到hive表的字段,并指定映射关系

建表语句语法如下:

CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
TBLPROPERTIES('mongo.uri'='<MongoURI>');

例子

CREATE EXTERNAL TABLE tencent_song_stable (
    song_id BIGINT,
    ctime BIGINT,
    publishCompany string,
    trackNumber INT,
    song_name string,
    artists ARRAY<STRUCT<artist_mid:string,artist_name:string,artist_id:bigint>>,
    lang string,
    originalName string,
    genre string,
    utime BIGINT,
    publishTime string,
    sizeogg INT,
    size_flac INT,
    size_ape INT,
    size320 INT,
    size128 INT,
    offline INT,
    bpm INT,
    playNum INT
)STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{"song_id":"_id",
"publishCompany":"publishCompany",
"trackNumber":"trackNumber",
"song_name":"name",
"artists.artist_mid":"artists.mid",
"artists.artist_name":"artists.name",
"artists.artist_id":"artists._id",
"lang":"language",
"originalName":"originalName",
"publishTime":"publishTime",
"playNum":"playNum"
}') 
TBLPROPERTIES('mongo.uri'='mongodb://readonly:xxxx@192.168.1.231:27100/mzk_spiders.tencent.song.stable');

4.映射关系-Mappings

4.1 映射字段和mongodb字段名一致的字段可以不用指定

If a mapping isn’t specified for a particular field name in Hive, the BSONSerDe assumes that the corresponding field name in the underlying MongoDB collection specified is the same.

比如上面的例子中有些字段就没有指定mapping

4.2 因为hive表大小写不敏感,因此需要注意驼峰写法的字段hive表都是小写,而mongodb不是。因此这种字段需要指定

4.3 建立映射关系时,需要注意映射的深度一致!!!

eg.不能”a.b.c”:”a.b” 或者”a.b”:”a.b.c” 这两种方式两边的深度都不一致,建表时会报错!!
只能”a.b”:”c.d”或者具有相同深度

Because the BSONSerDe mapper tries to infer upper level mappings from any multi-level mapping, the Hive struct field or column name has to be of the same depth as the MongoDB field being mapped to. So, we can’t have a mapping like a.b.c : d.e.f.g because the upper level mappings a:d, a.b:d.e, a.b.c:d.e.f are created, but it’s unclear what d.e.f.g should be mapped from. In the same vain, we can’t have a mapping like a.b.c.d : e.f.g because the upper level mappings a:e, a.b:e.f, a.b.c:e.f.g are created, but it’s unclear what a.b.c.d should be mapped to.

5.特别注意事项!!

5.1 删除表操作!!!

在Hive中创建表时,可以指定和不指定EXTERNAL,即创建的是Hive的内部表(table)还是外部表(EXTERNAL table)。

外部表(EXTERNAL table)
如果是外部表,那么在Hive中删除该表时,只是删除表的元数据信息(metadata),而mongodb的表数据不会被删除。

内部表
如果建表时没有指定EXTERNAL,那么在Hive中删除该表时,mongodb的表数据也会一并删除!!!
这个问题就严重啦!!!因此需要特别注意!!

为了以防万一,因此需要运维上注意权限设置!!!!

readonly 是mongodb只有读权限的用户
eg. ‘mongo.uri’=’mongodb://readonly:xxxx@192.168.1.231:27100/mzk_spiders.tencent.song.stable’

If the table created is EXTERNAL, when the table is dropped only its metadata is deleted; the underlying MongoDB collection remains intact. On the other hand, if the table is not EXTERNAL, dropping the table deletes both the metadata associated with the table and the underlying MongoDB collection.

5.2 MongoStorageHandler的局限性

目前版本中insert into 和inset overwrite操作都是插入操作,如果需要覆盖原表(overwrite),需要预先删除原表数据
Limitations of MongoStorageHandler

INSERT INTO vs. INSERT OVERWRITE: As of now, there’s no way for a table created with any custom StorageHandler to distinguish between the INSERT INTO TABLE and INSERT OVERWRITE commands. So both commands do the same thing: insert certain rows of data into a MongoDB-based Hive table. So to INSERT OVERWRITE, you’d have to first drop the table and then insert into the table.

6.详情参考原文

MongoDB Connector for Hadoop

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值