hive数据同步ES

0一只小菜鸡0

于 2023-11-14 15:31:08 发布

阅读量79

点赞数

文章标签： hive elasticsearch hadoop 大数据

本文链接：https://blog.csdn.net/m0_51066566/article/details/134397190

版权

1.首先，hadoop hive es 必须安装完成，以下是我安装的版本

Hadoop 3.1.3
hive-3.1.2
elasticsearch-7.10.0

2 .下载ES-Hadoop组件

下载与elasticsearch相对应的版本，所以我下载的是 elasticsearch-hadoop-7.10.0.zip

然后解压到一个单独的目录

unzip elasticsearch-hadoop-7.10.0.zip -d /opt/module/es_hdp/

这里没有安装unzip的，可以先安装一下unzip命令

sudo yum install -y unzip zip

解压后，进入目录查看文件，会发现有如下的一个jar文件 elasticsearch-hadoop-7.10.0.jar

到这里后，发现有些是放到了hdfs，有些放到了hive的lib目录下，其实放到哪里都是可以的，我这里直接复制到hive的lib目录下

cp elasticsearch-hadoop-7.10.0.jar /opt/module/hive/lib/

然后在hive中添加该jar包

add jar /opt/module/hive/lib/elasticsearch-hadoop-7.10.0.jar

好的，依赖添加完成，下面创建测试数据

在hive中随便建一张表，并插入几条测试数据

CREATE TABLE `student`(
  `id` int, 
  `name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hadoop102:9820/user/hive/warehouse/student'


hive (default)> select * from student;
OK
student.id	student.name
1	zhangsan
2	lisi
3	lisi
4	lisi
5	wangwu

接着，创建hive到es的映射表

CREATE EXTERNAL TABLE `hive2es`(
  `id` string COMMENT 'from deserializer', 
  `col1` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.elasticsearch.hadoop.hive.EsSerDe' 
STORED BY 
  'org.elasticsearch.hadoop.hive.EsStorageHandler' 
WITH SERDEPROPERTIES ( 
  'serialization.format'='1')
LOCATION
  'hdfs://hadoop102:9820/user/hive/warehouse/hive2es'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'es.batch.write.retry.count'='6', 
  'es.batch.write.retry.wait'='60s', 
  'es.index.auto.create'='TRUE', 
  'es.index.number_of_replicas'='0', 
  'es.index.refresh_interval'='-1', 
  'es.mapping.name'='id:id,col1:col1', 
  'es.nodes'='hadoop102:9200,hadoop103:9200,hadoop104:9200', 
  'es.resource'='hivemappinges/_doc', 
  'last_modified_by'='atguigu', 
  'last_modified_time'='1699933995', 
  'transient_lastDdlTime'='1699933995')

参数	参数	参数说明
bucketing_version	2
es.batch.write.retry.count	6
es.batch.write.retry.wait	60s
es.index.auto.create	TRUE	通过Hadoop组件向Elasticsearch集群写入数据，是否自动创建不存在的index： true：自动创建； false：不会自动创建
es.index.number_of_replicas	0
es.index.refresh_interval	-1	刷新时间，-1表示无刷新，适合迁移数据量大的情况，迁移完成后再设置一下是时间即可
es.mapping.name	7.10.2	hive和es集群字段映射
es.nodes		指定Elasticsearch实例的访问地址，建议使用内网地址。
es.nodes.wan.only	TRUE	开启Elasticsearch集群在云上使用虚拟IP进行连接，是否进行节点嗅探： true：设置；false：不设置
es.resource	7.10.2	es集群中索引名称
es.nodes.discovery	TRUE	是否禁用节点发现：true：禁用；false：不禁用
es.input.use.sliced.partitions	TRUE	是否使用slice分区： true：使用。设置为true，可能会导致索引在预读阶段的时间明显变长，有时会远远超出查询数据所耗费的时间。建议设置为false，以提高查询效率； false：不使用。
es.read.metadata	FALSE	操作Elasticsearch字段涉及到_id之类的内部字段，请开启此属性。

然后，在es创建对应的索引

PUT hivemappinges
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "col1": {
        "type": "keyword"
      }
    }
  }
}

然后，开启测试

insert overwrite table hive2es select id,name from student;

结果，报错了，说找不到类 org.apache.commons.httpclient

Error: java.lang.ClassNotFoundException: org.apache.commons.httpclient.Credentials
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
。。。

所以，需要下载对应的jar包点这里

粘贴，搜索

点击进入，找到对应的gva坐标

根据maven项目坐标，在阿里云镜像仓库里仓库服务查找对应的jar包，下载

jar包下好之后，与es-hadoop的jar一样，复制到hive的lib目录下，然后hive中添加jar

add jar /opt/module/hive/lib/commons-httpclient-3.1.jar

然后，再次进行测试

insert overwrite table hive2es select id,name from student;

mr执行成功

查看es,数据成功同步到es中

这里，在hive中添加的两个jar包

elasticsearch-hadoop-7.10.0.jar

commons-httpclient-3.1.jar

我们是以 add jar ... 的方式添加的，这种方式只对当前的session有效，当我们关闭后再次重启hive，需要再次进行添加jar的操作，所以，为了方便以后的操作，我们可以将其添加成永久性的jiar

编辑hive的配置文件hive-site.xml 在配置文件中增加配置

<property>
<name>hive.aux.jars.path</name>
<value>file:///jarpath/all_new1.jar,file:///jarpath/all_new2.jar</value>
</property>

<!-- hive同步es需要的依赖 -->
<property>

    <name>hive.aux.jars.path</name>
    <value>file:///opt/module/hive/lib/commons-httpclient-3.1.jar,file:///opt/module/hive/lib/elasticsearch-hadoop-7.10.0.jar</value>

</property>

再次启动hive，测试

同步成功！