HIVE数据导入ES并避免字段空值占用空间

最新推荐文章于 2023-03-22 09:06:04 发布

AA赵师傅

最新推荐文章于 2023-03-22 09:06:04 发布

阅读量1.6k

点赞数 1

分类专栏： ES hive 文章标签： hive elasticsearch 大数据

本文链接：https://blog.csdn.net/zcb_data/article/details/125901299

版权

hive 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

2 篇文章 0 订阅

订阅专栏

方案一：hive正常表导入ES流程

1、自己下载 elasticsearch-hadoop-6.6.0.jar依赖包，根据自己es版本选择，下载地址：

https://www.elastic.co/pt/downloads/hadoop

2、文件上传HDFS指定位置，我这边放在

/user/hive/jars/elasticsearch-hadoop-6.6.0.jar;

3、hive加载jar包

ADD JAR hdfs://uat01/user/hive/jars/elasticsearch-hadoop-6.6.0.jar;
ADD JAR hdfs://uat01/user/hive/jars/commons-httpclient-3.1.jar;

--注意，部分不加载“commons-httpclient-3.1.jar”可能会报错error in initSerDe: java.lang.NoClassDefFoundError org/apache/commons/httpclient，可以直接把包放在hive的lib下一样效果

4、准备好数据，数据样例，其中email字段部分内容置为空模拟数据为空效果

5、ES创建索引mapping

PUT user_info
{
 "mappings": {
 "user":{
 "properties": {
 "id":{
 "type": "long"
 },
 "name":{
 "type": "text"
 },
 "birthday":{
 "type": "text"
 },
 "email":{
 "type": "text"
 }
 }
 }
 }
}

6、建外部表

drop table user_info_es;
create external table user_info_es(
id string,
name string,
birthday string,
gender string,
email string)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.resource' = 'user_info/user', -- 索引名及type
'es.nodes'='192.168.139.101:9200', -- es地址
'es.mapping.id'='id' , -- 声明主键
'es.write.operation'='upsert', -- 声明更新
'es.index.refresh_interval' = '-1', 
'es.index.number_of_replicas' = '0',
'es.batch.write.retry.count' = '6',
'es.batch.write.retry.wait' = '60s',
'es.mapping.name' = 'id:id,name:name,birthday:birthday,gender:gender,email:email' -- 字段映射
);

7、往外部表导入数据

insert overwrite table user_info_es
select id,name,birthday, gender,email  from user_info_es_temp limit 100;

8、查看ES数据，null数据仍占用字段，当表中null值很多，会占用较大空间，优化方案如下

方案二：以JSON格式数据导入ES

步骤1~5同上

6、建外部表

drop table user_info_es;
create external table user_info_es(
jsonString string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.resource' = 'user_info/user', -- 索引名及type
'es.nodes'='192.168.139.101:9200', -- es地址
'es.mapping.id'='id' , -- 声明主键
'es.write.operation'='upsert', -- 声明更新
'es.index.refresh_interval' = '-1',
'es.index.number_of_replicas' = '0',
'es.batch.write.retry.count' = '6',
'es.batch.write.retry.wait' = '600s',
'es.input.json'  = 'yes' -- 必加 自动识别JSON
);

7、HIVE表数据加工成JSON

-- concat会把value为空的数据，concat置为空
-- 最后一个字段必须不含空值，或者使用concat_ws代替，否则会多个','逗号,导致JSON格式出现问题
select concat_ws('', '{', id, birthday, gender, email, name, '}') jsonString
from (select concat('\"id\":\"', id, '\",')             id,
             concat('\"birthday\":\"', birthday, '\",') birthday,
             concat('\"gender\":\"', gender, '\",')     gender,
             concat('\"email\":\"', email, '\",')       email,
             concat('\"name\":\"', name, '\"')          name
      from user_info_es_temp
     ) t;

8、样例数据

9、往外部表导入数据-导入前记得清空ES原数据，虽然是update，但email原本为null的更新不了

insert overwrite table user_info_es
-- concat会把value为空的数据，concat置为空
-- 最后一个字段必须不含空值，或者使用concat_ws代替，否则会多个','逗号,导致JSON格式出现问题
select concat_ws('', '{', id, birthday, gender, email, name, '}') jsonString
from (select concat('\"id\":\"', id, '\",')             id,
             concat('\"birthday\":\"', birthday, '\",') birthday,
             concat('\"gender\":\"', gender, '\",')     gender,
             concat('\"email\":\"', email, '\",')       email,
             concat('\"name\":\"', name, '\"')          name
      from user_info_es_temp
     ) t;