Hive数据同步到ES

最新推荐文章于 2024-08-08 13:06:49 发布

子毅168

最新推荐文章于 2024-08-08 13:06:49 发布

阅读量1.3k

点赞数

分类专栏：大数据文章标签： hive数据同步到es spark on yarn spark同步数据到es elasticsearch-hadoop的使用

本文链接：https://blog.csdn.net/weixin_42529806/article/details/102593746

版权

大数据专栏收录该内容

32 篇文章 4 订阅

订阅专栏

文章目录

Hive2Es

Hive2Es

需求

将Hive的user标签数据同步到ElasticSearch
每天生成一个index
将user_id作为文档id

准备工作-集群

Hadoop集群、Hive集群、Yarn集群（用的是CDH）
Spark集群（用的是CDH）
ElasticSearch集群（单独部署的）

准备工作-数据

Hive

Hive表

创建hive分区表（根据当天的日期进行分区）

hive shell #进入hive shell命令行

CREATE TABLE IF NOT EXISTS test_db.user_info ( 
user_id INT COMMENT '用户id',
age INT COMMENT '用户年龄',
name STRING COMMENT '用户姓名'
) COMMENT '服务信息标签表' PARTITIONED BY ( dt string );

插入两条数据

insert into test_db.user_info partition(dt="2019-10-16") values(1,18,'Tom');
insert into test_db.user_info partition(dt="2019-10-16") values(2,20,'Bob');

查询数据是否插入成功

1. 查询最新日期的数据
select * from test_db.user_info where dt in (select dt from test_db.user_info order by dt desc limit 1);

结果：
OK
1	18	Tom	2019-10-16
2	20	Bob	2019-10-16

编码

1.git地址：github.com/yangxifi/spark-study

注意事项

SparkSession初始化（local模式仅用于本地调试流程，部署的服务器上的时候得去掉）

public class SparkEnvUtil {

    public static SparkSession getSparkEnvWithHiveAndEsSup(AppConfig appConfig) {
        SparkSession spark = SparkSession.builder()
                // 注意本地测试完通过后把这行注释掉 不然服务器上启动
                // 会报java.lang.IllegalStateException: User did not initialize spark context!
//                .master("local[*]")
                .appName("hive2esDemo").config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .config("spark.kryo.registrator", MyKryoRegistrator.class.getName())
                .config("hive.metastore.uris", appConfig.getHivemetastoreUris())//指定hive的metastore的地址
                .config("spark.sql.warehouse.dir", appConfig.getHiveWarehouseDir())//指定hive的warehouse目录
                .config("es.nodes", appConfig.getEsNodes())//es的nodes
                .config("es.index.auto.create", "true")//配置es自动创建索引
                .enableHiveSupport()
                .getOrCreate();
        return spark;
    }

}

打包(注意排除掉spark、hadoop得包哦)
由于服务器上已经有这部分的包，所以排除掉（避免冲突），而且打得包还会小很多
因此只需要打包此部分工程独有的包（服务器上没有的）

            <!--maven打包插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <!--排除此部分包，避免和服务器上的包产生冲突-->
                                <excludes>
                                    <exclude>org.apache.hadoop.*</exclude>
                                    <exclude>org.apache.spark.*</exclude>
                                    <exclude>log4j:log4j:jar:</exclude>
                                    <exclude>org.slf4j.*</exclude>
                                </excludes>
                                <!--打包此部分工程独有的包（服务器上没有的）-->
                                <includes>
                                    <include>org.elasticsearch:*:*</include>
                                    <include>org.projectlombok:*:*</include>
                                    <include>commons-httpclient:*:*</include>
                                </includes>
                            </artifactSet>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

服务器部署

mvn clean package打包，丢到服务器的/home/hive2EsDemo目录下
以yarn-cluster模式启动

vim start-yarn-cluster.sh
脚本内容如下：
#!/bin/sh
/usr/bin/spark-submit --class com.spark.study.EsLoader \
--master yarn \
--deploy-mode cluster \
/home/hive2EsDemo/Hive2Es-Demo.jar \

sh start-yarn-cluster.sh
执行成功后

查看es的索引命令：curl -X GET “devenv-bigdata-datanode1:9200/_cat/indices?v”
返回
查看其中一个文档命令：curl -X GET “devenv-bigdata-datanode1:9200/user_info_20191016/_doc/1?pretty”
返回