目 录
-
项目实战——将Hive表的数据直接导入ElasticSearch
此篇文章不用写代码,简单粗暴,但是相对没有那么灵活;底层采用MapReduce计算框架,导入速度相对较慢! -
项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)
此篇文章需要Java代码,实现功能和篇幅类似,直接Java一站式解决Hive内用Spark取数,新建ES索引,灌入数据,并且采用ES别名机制,实现ES数据更新的无缝更新,底层采用Spark计算框架,导入速度相对文章1的做法较快的多!; -
项目实战——钉钉报警验证ElasticSearch和Hive数据仓库内的数据质量(Java版本)
此篇文章主要选取关键性指标,数据校验数据源Hive和目标ES内的数据是否一致; -
项目实战——Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本)
此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内; -
项目实战(生产环境部署上线)——参数配置化Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本))
此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内,同时而是,spark,es建索引参数配置化,每次新增一张表同步到es只需要新增一个xml配置文件即可,也是博主生产环境运用的java代码,弥补下很多老铁吐槽方法4的不足。
综述:
1.如果感觉编码能力有限,又想用到Hive数据导入ElasticSearch,可以考虑文章1;
2.如果有编码能力,个人建议采用文章2和文章3的组合情况(博主推荐),作为离线或者近线数据从数据仓库Hive导入ElasticSearch的架构方案,并且此次分享的Java代码为博主最早实现的版本1,主要在于易懂,实现功能,学者们可以二次加工,请不要抱怨代码写的烂;
3.如果是elasticsearch是自带账号密码权限认证的,如云产品或者自己设置了账号密码认证的,那么办法,只能用文章4了;
4.如果部署上线,还是要看文章5。
- 本人Hive版本:2.3.5
- 本人ES版本:7.7.1
- 本人Spark版本:2.3.3
背 景
应业务需求,要实现针对某产品的挑选搜索功能,决定采用ElasticSearch(以下简称ES)
作为后端搜索引擎服务,然后将符合条件的结果在ES搜索出来反馈给前端展示,但是我的基础数据都是存在数据仓库的Hive表内,这就面临一个问题,如何将Hive表的数据直接导入到ES内;
查阅官网,确定这个步骤想法的可行性,官网的例子相对简单,没涉及到复杂数据类型以及嵌套结构,我也是结合官网例子,再根据ES的特性和走了很多弯路才成功了,最后直接单纯用SQL语句就搞定了,没有写任何Java/python代码就搞定了,这里把经验分享给大家
- 官网关于Hive导入ES的介绍:Apache Hive integrationedit
- 本人Hive版本:2.3.5
- 本人ES版本:7.7.1
业务场景
我有很多依附在大厦上的门店,大厦和门店是1对多的关系,而这些门店不太方便客户合作的时候搜寻,所以在地图上搜索的时候先展示大厦,将符合客户需要的大厦先展示出来,然后再告诉客户这些大厦下的具体门店信息,给到运营人员去满足客户的需求调度,所以Hive的中的表结构和测试数据如下;
注意:经纬度和大厦名为瞎造,不要当真,只是测试数据,为了说明问题。
-- 建表
CREATE EXTERNAL TABLE `dm_rs_tbb_building_and_store`
(
`building_id` bigint comment'大厦id',
`building_name` string comment'大厦名称',
`building_amap_longitude` string comment'大厦经度',
`building_amap_latitude` string comment'大厦纬度',
`building_label` string comment'大厦标签',
`building_scope` string comment'大厦规模',
`store_id` bigint comment'门店id',
`store_name` string comment'门店名称',
`sotre_level` string comment'门店星级'
)
COMMENT 'dm层大厦门店关系表'
PARTITIONED BY (
`event_week` int,
`event_day` string,
`event_hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/hive/warehouse/dm/rs/dm_rs_tbb_building_and_store'
TBLPROPERTIES ('parquet.compression'='snappy')
;
-- 插入测试数据
insert overwrite table dm_rs_tbb_building_and_store partition(event_week='33',event_day='20200811',event_hour='00')
select
1 as `building_id`
,'唐琳大楼' as `building_name`
,'125.519831' as`building_amap_longitude`
,'31.176394' as `building_amap_latitude`
,'代表私营企业主,普通白领,商务楼'`building_label`
,'2万平以下' as `building_scope`
,1 as `store_id`
,'若隐_唐琳店' as `store_name`
,'AAA' as `sotre_level`
union all
select
1 as `building_id`
,'唐琳大楼' as `building_name`
,'125.519831' as`building_amap_longitude`
,'31.176394' as `building_amap_latitude`
,'代表私营企业主,普通白领,商务楼'`building_label`
,'2万平以下' as `building_scope`
,2 as `store_id`
,'若隐_唐琳五门店' as `store_name`
,'AAAA' as `sotre_level`
union all
select
1 as `building_id`
,'唐琳大楼' as `building_name`
,'125.519831' as`building_amap_longitude`
,'31.176394' as `building_amap_latitude`
,'代表私营企业主,普通白领,商务楼'`building_label`
,'2万平以下' as `building_scope`
,3 as `store_id`
,'若隐_唐琳旗舰店' as `store_name`
,'AAAA' as `sotre_level`
union all
select
2 as `building_id`
,'国际娱乐城大厦' as `building_name`
,'113.389831' as`building_amap_longitude`
,'23.212363' as `building_amap_latitude`
,'代表外籍人士及归国华侨,普通白领,娱乐场所'`building_label`
,'3万平以下' as `building_scope`
,4 as `store_id`
,'若隐_娱乐城店' as `store_name`
,'AAAAA' as `sotre_level`
union all
select
2 as `building_id`
,'国际娱乐城大厦' as `building_name`
,'113.389831' as`building_amap_longitude`
,'23.212363' as `building_amap_latitude`
,'代表外籍人士及归国华侨,普通白领,娱乐场所'`building_label`
,'3万平以下' as `building_scope`
,5 as `store_id`
,'若隐_娱乐城西门店' as `store_name`
,'AAA' as `sotre_level`
;
数据预览结果如图1
但是ES那边希望的数据结构不是照搬的,他希望第一经纬度是ES独有的数据类型geo_point(该类型Hive没有)
,并且要展示为大厦,门店嵌套结构,来减少数据的冗余,具体效果如下;
{
building_id: 1,
building_name: "国际娱乐城大厦",
building_map:
{
lat: "23.212363",
lon: "113.389831"
},
building_scope: "3万平米以下",
building_label: "代表外籍人士及归国华侨,普通白领,娱乐场所",
stores_info:
[
{
store_id: 4,
store_name: "若隐_娱乐城店",
sotre_level: AAAAA
},
{
store_id: 5,
store_name: "若隐_娱乐城西门店",
sotre_level: AAA
}
]
}
实战步骤
ES建好Index及其Mapping结构
建立索引,kibana
建索引和mapping schema的语句如下,关于ES每个Field的数据类型,这里就做几个特殊类型的解释吧,更多的可以直接参考官网:ES数据属性列的类型;
“number_of_replicas”: 2 ES索引数据的备份数
,“number_of_shards”: 5 ES索引数据的分片数
,“max_result_window” : 10000 应去前端调度要求,做高并发省时间,表示每个分页返回的数据量的行数,类似于sql的limit 10000
“type”: “text” 字符串类型,支持分词处理
“type”:“keyword” 字符串类型,不支持分词处理,即只能用来整个关键字搜索
“type”: “geo_point” ES独有的数据类型,专门用来适配经纬度
“type”: “nested” 表示该field数据是嵌套数据
PUT /app_es_rs_tbb_building_and_store
{
"settings":
{
"number_of_replicas": 1
,"number_of_shards": 5
,"max_result_window" : 1000000
}
,"mappings":
{
"properties" :
{
"building_id" :
{
"type": "long"
}
,"building_name" :
{
"type": "text" , "analyzer": "ik_smart"
}
,"building_map":
{
"type": "geo_point"
}
,"building_label":
{
"type": "text" , "analyzer": "ik_smart"
}
,"building_scope":
{
"type":"keyword"
}
,"load_time":
{
"type": "date"
, "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
,"stores_info":
{
"type": "nested" ,
"properties":
{
"store_id":
{
"type": "long"
}
,"store_name":
{
"type": "text" , "analyzer": "ik_smart"
}
,"sotre_level":
{
"type": "keyword"
}
}
}
}
}
}
下载Hive导入ES依赖的Jar包
选择自己ES版本相符的依赖Jar包,如图2,官网下载链接如下;
传送门:Hive导入ES依赖的Jar包
点击DOWNLOAD
后会跳转相应的版本,然后点击ZIP
文件,下载好ZIP
文件后解压,就能得到我们需要的Jar包:elasticsearch-hadoop-7.7.1.jar
;
将该Jar包上传到你的Hive集群的某个local节点或者hdfs上都行,这里以local节点为例子,如放在路径/home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar;
建立Hive表映射ES的Mapping结构
需要再新建一个表app_es_rs_tbb_building_and_store
,这个表示存入ES的表结构样式,需要保持跟ES的结构一致,最少能起码能转换过去,这里,采用hive的struct数据结构来映射ES的geo_point,采用hive的array里面每个元素都是struct来映射ES的嵌套结构nested,还是非常有意思的,更多的类型转换,可以参考官网的Type conversion,如图3
注意:geo_point要写成纬度,经度的结构体位置,不然容易报错,如果是结构体,那么最后映射到ES的geo_point数据是如下;
building_map:
{
lat: "36.679025",
lon: "117.061612"
}
同时也支持Hive建表的时候将building_map 定义为string型,然后Hive数据的值为concat(coalesce(p.building_amap_latitude,0.0),',',coalesce(p.building_amap_longitude)
,也是纬经度,这样数据也不会报错,只是到了ES的数据显示类型变了,变成了如下,试用起来没有第一种的key,value好用。
building_map:"36.679025,117.061612"
ES的时间类型非常的挑剔,约束如下
"load_time":
{
"type": "date"
,"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
所以时间类型约束也比较严格,不能搞一些花里胡哨的类型yyyy-M-d,yyyy-MM-dd HH:mm:sss,dd-MM-yyyy
等,最终都是坑死自己;
在Hive Cli上执行的建立映射ES的表如下;
hive> add jar file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar;
Added [file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar] to class path
Added resources: [file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar]
hive>
> drop table if exists app_es_rs_tbb_building_and_store;
OK
Time taken: 0.508 seconds
hive> CREATE EXTERNAL TABLE `app_es_rs_tbb_building_and_store`(
> `building_id` bigint,
> `building_name` string,
> building_map struct<lat:string,lon:string>,
> `building_label` string,
> `building_scope` string,
> `load_time` string,
> stores_info array<
> struct<
> `store_id`:bigint,
> `store_name`:string,
> `sotre_level`:string
> > >
> )
> COMMENT 'dm层大厦门店数据表'
> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
> TBLPROPERTIES(
> 'es.nodes' = '10.217.110.123:9200, 10.217.10.131:9200,10.217.19.1352:9200,10.217.10.102:9200,10.217.10.100:9200,10.217.10.138:9200,10.217.9.140:9200',
> 'es.index.auto.create' = 'false',
> 'es.resource' = 'app_es_rs_tbb_building_and_store',
> 'es.mapping.names'='building_id:building_id,building_name:building_name,building_map:building_map,building_label=building_label,load_time:load_time,stores_info:stores_info'
> );
OK
Time taken: 0.088 seconds
es.nodes:写你整个ES集群的IP和端口,默认都是9200端口,除非手动改过;
es.index.auto.create:因为我们自己建好了ES的Index,所这里设置为false,如果是true,则会自动创建Index,并且在insert数据的时候会自动创建ES的mapping,但是自动的东西往往不是我们想要的,所以,建议写成false;
es.resource:ES的Index名字;
es.mapping.names:Hive表的列名(前面的)映射ES上Index的Field名字(后面的),如果你的Hive映射ES的表的名字和ES的Index的field名字完全一致,像本人这么规范,则此项也可以不写,写了也不会报错。
(‘es.mapping.id’ = ‘id’,这里没有指定该属性,如果指定,代表的是将Hive的某一唯一键,如id字段,传给ES的某一Inde的doc当id,如果不指定,则该Index的doc的id为自动生成。
将数据写入映射ES的Hive表中(也就写进了ES的Index内了)
这个没什么好说的,在Hive Cli上执行插入目标表的数据,一个Insert语句搞定,具体如下;
hive> add jar file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar;
Added [file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar] to class path
Added resources: [file:///home/liuxiaowei/elasticsearch-hadoop-7.7.1.jar]
hive> insert OVERWRITE table app_es_rs_tbb_building_and_store
> select
> p.building_id
> ,p.building_name
> ,named_struct('lat',coalesce(p.building_amap_latitude,0.0),'lon',coalesce(p.building_amap_longitude,0.0)) as building_map
> ,p.building_label
> ,p.building_scope
> ,substr(CURRENT_DATE(),1,10) as load_time
> ,collect_set(named_struct
> (
> 'store_id',p.store_id
> ,'store_name',p.store_name
> ,'sotre_level',p.sotre_level
> )) as locations_info
> from dw.dm_rs_tbb_building_and_store p
> where p.event_day='20200811'
> group by
> p.building_id
> ,p.building_name
> ,named_struct('lat',coalesce(p.building_amap_latitude,0.0),'lon',coalesce(p.building_amap_longitude,0.0))
> ,p.building_label
> ,p.building_scope
> ,substr(CURRENT_DATE(),1,10)
> ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = liuxiaowei_20200818193521_a6dc7622-0ac9-4a01-b2bb-f21763d1f6b7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1587046987432_84127, Tracking URL = http://shucang-10.szanba.ren:8088/proxy/application_1587046987432_84127/
Kill Command = /data/tools/hadoop/current//bin/hadoop job -kill job_1587046987432_84127
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2020-08-18 19:35:43,737 Stage-3 map = 0%, reduce = 0%
2020-08-18 19:35:59,787 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 8.32 sec
2020-08-18 19:36:13,317 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 13.12 sec
MapReduce Total cumulative CPU time: 13 seconds 120 msec
Ended Job = job_1587046987432_84127
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 13.12 sec HDFS Read: 18867 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 120 msec
OK
Time taken: 53.297 seconds
常见报错:
- Hive的select出来的字段个数、字段类型跟insert映射到ES的目标Hive表的字段个数、字段类型不一致,这个也很好理解,关系数据库也会报错的,报错信息大致如下;
FAILED: SemanticException [Error 10044]: Line 1:24 Cannot insert into target table because column number/types are different 'app_es_rs_tbb_building_and_store':
- select写入的值跟满足映射到ES的目标Hive表的类型,但是不满足ES的类型约束,如写入的纬经度有null值,则在ES的geo_point是不支持的,其实就是Hive和ES的类型转换要合理,报错信息如下;
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse field [building_map] of type [geo_point];org.elasticsearch.hadoop.rest.EsHadoopRemoteException: parse_exception: latitude must be a number
ES数据效果预览
一切调试完成后,可以在kibana
,浏览器
,postman
查看导入数据的结果,具体如下;
kibana
输入指令,如图4;
GET /app_es_rs_tbb_building_and_store/
结果如下;
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "app_es_rs_tbb_building_and_store",
"_type" : "_doc",
"_id" : "gzxRAXQBhhmjlyEDRJ6r",
"_score" : 1.0,
"_source" : {
"building_id" : 1,
"building_name" : "唐琳大楼",
"building_map" : {
"lat" : "31.176394",
"lon" : "125.519831"
},
"building_label" : "代表私营企业主,普通白领,商务楼",
"building_scope" : "2万平以下",
"load_time" : "2020-08-18",
"stores_info" : [
{
"store_id" : 1,
"store_name" : "若隐_唐琳店",
"sotre_level" : "AAA"
},
{
"store_id" : 2,
"store_name" : "若隐_唐琳五门店",
"sotre_level" : "AAAA"
},
{
"store_id" : 3,
"store_name" : "若隐_唐琳旗舰店",
"sotre_level" : "AAAA"
}
]
}
},
{
"_index" : "app_es_rs_tbb_building_and_store",
"_type" : "_doc",
"_id" : "hDxRAXQBhhmjlyEDRJ6r",
"_score" : 1.0,
"_source" : {
"building_id" : 2,
"building_name" : "国际娱乐城大厦",
"building_map" : {
"lat" : "23.212363",
"lon" : "113.389831"
},
"building_label" : "代表外籍人士及归国华侨,普通白领,娱乐场所",
"building_scope" : "3万平以下",
"load_time" : "2020-08-18",
"stores_info" : [
{
"store_id" : 4,
"store_name" : "若隐_娱乐城店",
"sotre_level" : "AAAAA"
},
{
"store_id" : 5,
"store_name" : "若隐_娱乐城西门店",
"sotre_level" : "AAA"
}
]
}
}
]
}
}
浏览器和postman则采用http的get方式预览,输入指令
http://10.217.9.140:9200/app_es_rs_tbb_building_and_store/_search
效果如图5;
以上就是关于Hive表直接导入到ES的操作,该项目的优缺点如下;
- 优点:就是操作简单,可以不涉及任何代码的介入,将Hive和ES可以贯通;
- 缺点:数据毕竟存在Hive内,内部核心还是MapReduce计算,无法做到真正的实时操作,只能做到批处理级别的操作,针对T+1,h+1等批处理的操作适用,不适用实时计算(当然实时计算也不会入Hive);