MongoDB Connector for Hadoop(一)

最新推荐文章于 2024-08-19 10:33:32 发布

我家有个艳

最新推荐文章于 2024-08-19 10:33:32 发布

阅读量2.1k

点赞数 2

分类专栏： Hive 文章标签： mongodb hive

本文链接：https://blog.csdn.net/u010814849/article/details/68485316

版权

Hive 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

MongoDB连接器
MongoDB连接器 for Hive的两种用法
Connecting to MongoDB - MongoStorageHandler
- 1 安装
- 2 Quickstart Example
映射关系-Mappings
特别注意事项
- 1 删除表操作
- 2 MongoStorageHandler的局限性
详情参考原文

1.MongoDB连接器

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem including the following:

Pig
Spark
MapReduce
Hadoop Streaming
Hive
Flume

2.MongoDB连接器 for Hive的两种用法

MongoDB连接器有两种方式使Hive可以使用MongoDB的数据
- Connecting to MongoDB - MongoStorageHandler
- Using BSON files - STORED AS (Specified SerDe, INPUT and OUTPUT)

第一种：直接建立连接,指定映射关系，但是需要注意权限问题(删除表操作！！)

简单原理解析：建立hive表和mongodb表的映射关系，然后查询hive表时会解析向下转发，查询mongodb表，再将结果返回。

第二种：采用BSON文件
从mongodb中导出BSON文件(mongodump),再在Hive中导入BSON文件

本文主要讲述第一种方式，第二种方式请参考MongoDB Connector for Hadoop(二)

3.Connecting to MongoDB - MongoStorageHandler

3.1 安装

下载下面三个java包到hive的lib包下，重启hive

- mongo-hadoop-core-2.0.2.jar
- mongo-hadoop-hive-2.0.2.jar
- mongo-java-driver-3.4.2.jar

或者

<dependency>
    <groupId>org.mongodb.mongo-hadoop</groupId>
    <artifactId>mongo-hadoop-core</artifactId>
    <version>1.5.1</version>
</dependency>

3.2 Quickstart Example

根据mongodb的表字段，选择需要映射到hive表的字段，并指定映射关系

建表语句语法如下：

CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
TBLPROPERTIES('mongo.uri'='<MongoURI>');

例子

CREATE EXTERNAL TABLE tencent_song_stable (
    song_id BIGINT,
    ctime BIGINT,
    publishCompany string,
    trackNumber INT,
    song_name string,
    artists ARRAY<STRUCT<artist_mid:string,artist_name:string,artist_id:bigint>>,
    lang string,
    originalName string,
    genre string,
    utime BIGINT,
    publishTime string,
    sizeogg INT,
    size_flac INT,
    size_ape INT,
    size320 INT,
    size128 INT,
    offline INT,
    bpm INT,
    playNum INT
)STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{"song_id":"_id",
"publishCompany":"publishCompany",
"trackNumber":"trackNumber",
"song_name":"name",
"artists.artist_mid":"artists.mid",
"artists.artist_name":"artists.name",
"artists.artist_id":"artists._id",
"lang":"language",
"originalName":"originalName",
"publishTime":"publishTime",
"playNum":"playNum"
}') 
TBLPROPERTIES('mongo.uri'='mongodb://readonly:xxxx@192.168.1.231:27100/mzk_spiders.tencent.song.stable');

4.映射关系-Mappings

4.1 映射字段和mongodb字段名一致的字段可以不用指定

If a mapping isn’t specified for a particular field name in Hive, the BSONSerDe assumes that the corresponding field name in the underlying MongoDB collection specified is the same.

比如上面的例子中有些字段就没有指定mapping

4.2 因为hive表大小写不敏感，因此需要注意驼峰写法的字段hive表都是小写，而mongodb不是。因此这种字段需要指定

4.3 建立映射关系时，需要注意映射的深度一致！！！

eg.不能”a.b.c”:”a.b” 或者”a.b”:”a.b.c” 这两种方式两边的深度都不一致，建表时会报错！！
只能”a.b”:”c.d”或者具有相同深度

Because the BSONSerDe mapper tries to infer upper level mappings from any multi-level mapping, the Hive struct field or column name has to be of the same depth as the MongoDB field being mapped to. So, we can’t have a mapping like a.b.c : d.e.f.g because the upper level mappings a:d, a.b:d.e, a.b.c:d.e.f are created, but it’s unclear what d.e.f.g should be mapped from. In the same vain, we can’t have a mapping like a.b.c.d : e.f.g because the upper level mappings a:e, a.b:e.f, a.b.c:e.f.g are created, but it’s unclear what a.b.c.d should be mapped to.

5.特别注意事项！！

5.1 删除表操作！！！

在Hive中创建表时，可以指定和不指定EXTERNAL，即创建的是Hive的内部表(table)还是外部表(EXTERNAL table)。

外部表(EXTERNAL table)
如果是外部表，那么在Hive中删除该表时，只是删除表的元数据信息(metadata),而mongodb的表数据不会被删除。

内部表
如果建表时没有指定EXTERNAL，那么在Hive中删除该表时，mongodb的表数据也会一并删除！！！
这个问题就严重啦！！！因此需要特别注意！！

为了以防万一，因此需要运维上注意权限设置！！！！

readonly 是mongodb只有读权限的用户
eg. ‘mongo.uri’=’mongodb://readonly:xxxx@192.168.1.231:27100/mzk_spiders.tencent.song.stable’

If the table created is EXTERNAL, when the table is dropped only its metadata is deleted; the underlying MongoDB collection remains intact. On the other hand, if the table is not EXTERNAL, dropping the table deletes both the metadata associated with the table and the underlying MongoDB collection.

5.2 MongoStorageHandler的局限性

目前版本中insert into 和inset overwrite操作都是插入操作，如果需要覆盖原表(overwrite)，需要预先删除原表数据
Limitations of MongoStorageHandler

INSERT INTO vs. INSERT OVERWRITE: As of now, there’s no way for a table created with any custom StorageHandler to distinguish between the INSERT INTO TABLE and INSERT OVERWRITE commands. So both commands do the same thing: insert certain rows of data into a MongoDB-based Hive table. So to INSERT OVERWRITE, you’d have to first drop the table and then insert into the table.