MongoDB Connector for Hadoop(二)

最新推荐文章于 2024-08-19 10:33:32 发布

我家有个艳

最新推荐文章于 2024-08-19 10:33:32 发布

阅读量1.3k

点赞数 2

分类专栏： Hive 文章标签： mongodb hive

本文链接：https://blog.csdn.net/u010814849/article/details/68485330

版权

Hive 专栏收录该内容

5 篇文章

订阅专栏

I 问题
II 解决方案
导数据
创建Hive表-Using BSON files - STORED AS Specified SerDe INPUT and OUTPUT
踩坑事项
- 1 中文字段映射问题

I 问题

在上一篇文章中介绍了Connecting to MongoDB - MongoStorageHandler。这种方式是将mongodb中的表和hive表建立连接(映射关系)，从而可以在hive中用HSQL操作mongodb的数据。

但是另一个问题出现了：
1.删除表操作是很危险的，没有做好权限管理的话，在Hive中删除表，也会将mongodb的表一并删除。
2.MongoDB的表数据量很大，建立连接后在Hive中查询或者其他操作时，那个速度很慢很慢！！有多慢？自行去实验体会吧！！

那么怎么处理这些问题呢？

II 解决方案

问题一:运维设置MongoDB的操作权限

在MongoDB中创建两个用户：
- 一个”超级用户”，该用户拥有读写权限;
- 一个”只读用户”,该用户只能读取数据。

在使用MongoStorageHandler建立连接(映射关系)时,使用只读用户，这样在Hive中删除表时，不会删除mongodb的表数据。

问题二：有两种方式

A. 使用只读用户建立表连接后，再在Hive创建一个Hive内部表，使用查询语句将连接表的数据insert into(overwrite)到内部表中，之后使用内部表查询或者其他操作。
B. 使用mongodump从mongodb中导出BSON文件，再使用MongoDB连接器的第二种使用方式Hivb创建表，导入BSON文件，OK搞定！谁用谁知道！

下面进入正题，如何使用MongoDB连接器的第二种方式在Hivb创建表：

1.导数据

mongodump <options>

-h, --host=<hostname>                                     数据库ip
    --port=<port>                                         端口(也可使用 --host hostname:port)

-u, --username=<username>                                 用户名
-p, --password=<password>                                 密码

-d, --db=<database-name>                                  数据库
-c, --collection=<collection-name>                        表

-q, --query=                                              查询过滤条件(JSON字段串) e.g., '{x:{$gt:1}}'
    --queryFile=                                          包含查询条件的文件路径

-o, --out=<directory-path>                                输出路径

例子

[hadoop@host-231 ~]$ mongodump --host 192.168.1.231 --port 27100 --username readonly --password  xxx --db mzk_spiders --collection tencent.song.stable --out '/home/hadoop/BigData/data'
2017-03-30T15:44:01.171+0800    writing mzk_spiders.tencent.song.stable to
2017-03-30T15:44:04.163+0800    [........................]  mzk_spiders.tencent.song.stable  56362/17121994  (0.3%)
2017-03-30T15:44:07.163+0800    [........................]  mzk_spiders.tencent.song.stable  111217/17121994  (0.6%)

成功导出后，会在/home/hadoop/BigData/data路径下多出一个文件夹,该文件夹为db的名字，文件中有2个文件：
- tencent.song.stable.bson
- tencent.song.stable.metadata.json

2.创建Hive表-Using BSON files - STORED AS (Specified SerDe, INPUT and OUTPUT)

安装方式请参考上一篇文档

2.1建表语句语法：

CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];

LOCATION子句可以指定表数据存放的位置(HDFS路径),如果不指定LOCATION，那么数据默认存放在/user/hive/warehouse下的表名文件中

2.2 序列化和反序列化

当读取MongoDB-based和BSON-based表中存储的数据时必须通过BSONSerDe反序列化成Hive对象，反之亦然。

如果Hive字段类型和MongoDB表字段类型不匹配，那么Hive中该字段会为null。

原始类型的转换很简单，但是只需要记住以下几个复杂的对象类型匹配关系：

Hive Object	MongoDB Object
STRUCT	embedded document
MAP	embedded document
ARRAY	array
STRUCT	ObjectId

2.3 BSONSerDe Mappings-映射关系

2.3.1 映射字段和mongodb字段名一致的字段可以不用指定

If a mapping isn’t specified for a particular field name in Hive, the BSONSerDe assumes that the corresponding field name in the underlying MongoDB collection specified is the same.

2.3.2 因为hive表大小写不敏感，因此需要注意驼峰写法的字段hive表都是小写，而mongodb不是。因此这种字段需要指定

2.3.3 建立映射关系时，需要注意映射的深度一致！！！

eg.不能”a.b.c”:”a.b” 或者”a.b”:”a.b.c” 这两种方式两边的深度都不一致，建表时会报错！！
只能”a.b”:”c.d”或者具有相同深度

Because the BSONSerDe mapper tries to infer upper level mappings from any multi-level mapping, the Hive struct field or column name has to be of the same depth as the MongoDB field being mapped to. So, we can’t have a mapping like a.b.c : d.e.f.g because the upper level mappings a:d, a.b:d.e, a.b.c:d.e.f are created, but it’s unclear what d.e.f.g should be mapped from. In the same vain, we can’t have a mapping like a.b.c.d : e.f.g because the upper level mappings a:e, a.b:e.f, a.b.c:e.f.g are created, but it’s unclear what a.b.c.d should be mapped to.

2.4 导数据

导入BSON文件到Hive表中有两种方式：

使用HDFS命令 put BSON文件到建表语句指定的LOCATION路径的表文件中

hadoop fs -put /home/hadoop/BigData/data/mzk_spiders/tencent.song.stable.bson /user/hive/warehouse/tencent_song_stable_bson

使用Hive 的load data命令将BSON文件导入到Hive表中(load data命令其实底层也是将文件copy到表文件下)

load data local inpath '/home/hadoop/BigData/data/mzk_spiders/tencent.song.stable.bson' into table tencent_song_stable_bson

通过以上2种方式之一就可以将数据导入Hive表中，查询表中数据，检查数据是否正确，如果发现数据有误，请核对Hive建表语句

3.踩坑事项

3.1 中文字段映射问题

在mongodb中的字段为中文时，指定映射关系也正确，但是导入数据后发现该字段全部为null，经过查找发现Hive元数据库中的字段信息中文乱码，修改元数据表字符集


CREATE TABLE mongodb_connector.mytest_artist_stable_bson2 (
artist_id bigint,
extInfo struct<birthPlace:string,recordcompany:string,birthday:string>,
name string,
nation string
)ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{
"artist_id":"_id",
"extInfo.birthPlace":"extInfo.出生地",
"extInfo.recordcompany":"extInfo.经纪公司",
"extInfo.birthday":"extInfo.出生日期"
}'
)
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat';

==设置hive 元数据表对应字段的字符集为utf-8==

alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;