Apache Hive integration with Elasticsearh

最新推荐文章于 2022-08-03 14:39:24 发布

BIGDATA08

最新推荐文章于 2022-08-03 14:39:24 发布

阅读量541

点赞数

分类专栏： elasticsearch 文章标签： hive elasticsearch

elasticsearch 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Configurationedit

When using Hive, one can use TBLPROPERTIES to specify the configuration properties (as an alternative to Hadoop Configuration object) when declaring the external table backed by Elasticsearch:

使用hive时，声明一个由Elasticsearch支持的外部表。可以用TBLPROPERTIES指定一个配置属性（作为一个Hadoop Configuraion object替代品）

CREATE EXTERNAL TABLE artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

'es.index.auto.create' = 'false') ;

elasticsearch-hadoop setting

Mappingedit

By default, elasticsearch-hadoop uses the Hive table schema to map the data in Elasticsearch, using both the field names and types in the process. There are cases however when the names in Hive cannot be used with Elasticsearch (the field name can contain characters accepted by Elasticsearch but not by Hive). For such cases, one can use the es.mapping.names setting which accepts a comma-separated list of names mapping in the following format: Hive field name:Elasticsearch field name

在默认情况下，elasticsearch-hadoop使用Hive table schema映射数据到Elasticsearch，过程中使用字段名和类型，但是有很多情况下在hive中字段名不能被Elasticsearch使用。这时，可以用es.mapping.names设置接受一个以逗号分隔的字段名表映射在下面这种格式：Hive field name:Elasticsearch field name

To wit:

CREATE EXTERNAL TABLE artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

'es.mapping.names' = 'date:@timestamp , url:url_123 ');

两个字段名date和url，用逗号分隔开

name mapping for two fields

Hive column date mapped in Elasticsearch to @timestamp

Hive column url mapped in Elasticsearch to url_123

Hive is case insensitive while Elasticsearch is not. The loss of information can create invalid queries (as the column in Hive might not match the one in Elasticsearch). To avoid this, elasticsearch-hadoop will always convert Hive column names to lower-case. This being said, it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.

Hive是大小写不敏感，但Elasticsearch大小写敏感。这样可能会创建无效的queries（例如，Hive的列和Elasticsearch的列不匹配）。为了避免这个，elasticsearch-hadoop总是转化Hive列名为小写。它推荐使用默认Hive并且大写名字仅能作为Hiver命令来避免混合大小写的名字。

Hive treats missing values through a special value NULL as indicated here here. This means that when running an incorrect query (with incorrect or non-existing field names) the Hive tables will be populated with NULL instead of throwing an exception. Make sure to validate your data and keep a close eye on your schema since updates will otherwise go unnotice due to this lenient behavior.

Hive用特定值null处理缺失值。这意味着当运行一个错误的query（用一个错误的或者不存在的字段名）hive表就用null填充就不报错了。确认你的数据有效并注意schema的更新，不然就注意不到这个现象。

Writing data to Elasticsearchedit

写数据到Elasticsearch

With elasticsearch-hadoop, Elasticsearch becomes just an external table in which data can be loaded or read from:

在elasticsearch-hadoop中，Elasticsearch就成了一个数据可被load或read的外部表

CREATE EXTERNAL TABLE artists (

    id      BIGINT,

    name    STRING,

    links   STRUCT<url:STRING, picture:STRING>)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists');

-- insert data to Elasticsearch from another table called 'source'

INSERT OVERWRITE TABLE artists

    SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture)

                    FROM source s;

Elasticsearch Hive StorageHandler

Elasticsearch resource (index and type) associated with the given storage

For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the table properties:

在id需要被指定的文档，可以用映射es.mapping.id。接着前面的例子，说明Elasticsearch用字段id做文档的id，更新table属性：

CREATE EXTERNAL TABLE artists (

    id      BIGINT,

    ...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.mapping.id' = 'id'...);

Writing existing JSON to Elasticsearchedit

For cases where the job input data is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. In such cases, one needs to indicate the json input by setting the es.input.json parameter. As such, in this case elasticsearch-hadoop expects the output table to contain only one field, who s content is used as the JSON document. That is, the library will recognize specific textual types (such as string or binary) or simply call (toString).

写已有的JSON 到Elasticsearch中

在任务是要写入数据已经在JSON中，elasticsearch-hadoop允许直接索引无需应用转换；数据直接发送到Elasticsearch。在这个情况下，一种需要说明通过设置JSON输入es.input.json 参数。这时elasticsearch-hadoop期望输出表仅含有一个内容使用JSON的字段，就这样库会被辨识为特定的文本类型（例如string或binary）或就叫（toString）.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Apache Hive integration with Elasticsearh

ConfigurationeditWhen using Hive, one can use TBLPROPERTIES to specify the configuration properties (as an alternative to Hadoop Configuration object) when declaring the external table backed by E
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。