项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)

20 篇文章 6 订阅
8 篇文章 0 订阅

目 录

  1. 项目实战——将Hive表的数据直接导入ElasticSearch
      此篇文章不用写代码,简单粗暴,但是相对没有那么灵活;底层采用MapReduce计算框架,导入速度相对较慢!

  2. 项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)
      此篇文章需要Java代码,实现功能和篇幅类似,直接Java一站式解决Hive内用Spark取数,新建ES索引,灌入数据,并且采用ES别名机制,实现ES数据更新的无缝更新,底层采用Spark计算框架,导入速度相对文章1的做法较快的多!;

  3. 项目实战——钉钉报警验证ElasticSearch和Hive数据仓库内的数据质量(Java版本)
      此篇文章主要选取关键性指标,数据校验数据源Hive和目标ES内的数据是否一致;

  4. 项目实战——Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本)
      此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内;

  5. 项目实战(生产环境部署上线)——参数配置化Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本))
      此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内,同时而是,spark,es建索引参数配置化,每次新增一张表同步到es只需要新增一个xml配置文件即可,也是博主生产环境运用的java代码,弥补下很多老铁吐槽方法4的不足。

  综述:
  1.如果感觉编码能力有限,又想用到Hive数据导入ElasticSearch,可以考虑文章1;
  2.如果有编码能力,个人建议采用文章2和文章3的组合情况(博主推荐),作为离线或者近线数据从数据仓库Hive导入ElasticSearch的架构方案,并且此次分享的Java代码为博主最早实现的版本1,主要在于易懂,实现功能,学者们可以二次加工,请不要抱怨代码写的烂;
  3.如果是elasticsearch是自带账号密码权限认证的,如云产品或者自己设置了账号密码认证的,那么办法,只能用文章4了;
  4.如果部署上线,还是要看文章5。

  • 本人Hive版本:2.3.5
  • 本人ES版本:7.7.1
  • 本人Spark版本:2.3.3

项目树

  总体项目树图谱如图1所示,编程软件:IntelliJ IDEA 2019.3 x64,采用Maven架构;
  项目链接地址:项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)

  • feign:连接ES和Spark客户端相关的Java类;
  • utils:操作ES和Spark相关的Java类;
  • resources:日志log的配置类;
  • pom.xml:Maven配置文件;

在这里插入图片描述

图1 项目树图谱

Maven配置文件pox.xml

  该项目使用到的Maven依赖包存在pom.xml上,具体如下所示;.

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>SparkOnHiveToEs_v1</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>SparkOnHiveToEs_v1</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch -->
    <!--ES本身的依赖-->
    <dependency>
      <groupId>org.elasticsearch</groupId>
      <artifactId>elasticsearch</artifactId>
      <version>7.7.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
    <!--ES高级API,用来连接ES的Client等操作-->
    <dependency>
      <groupId>org.elasticsearch.client</groupId>
      <artifactId>elasticsearch-rest-high-level-client</artifactId>
      <version>7.7.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/junit/junit -->
    <!--junit,Test测试使用-->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>


    <!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
    <!--lombok ,用来自动生成对象类的构造函数,get,set属性等-->
    <dependency>
      <groupId>org.projectlombok</groupId>
      <artifactId>lombok</artifactId>
      <version>1.18.12</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.testng</groupId>
      <artifactId>testng</artifactId>
      <version>RELEASE</version>
      <scope>compile</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
    <!--jackson,用来封装json-->
    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-databind</artifactId>
      <version>2.11.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20 -->
    <dependency>
      <groupId>org.elasticsearch</groupId>
      <artifactId>elasticsearch-spark-20_2.11</artifactId>
      <version>7.7.1</version>
    </dependency>


    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.3.3</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.3</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.3.3</version>
      <scope>compile</scope>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>compile</scope>
    </dependency>

    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-core</artifactId>
      <version>2.9.1</version>
    </dependency>

    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-api</artifactId>
      <version>2.9.1</version>
    </dependency>
  </dependencies>


  <build>
  <plugins>
    <!-- 在maven项目中既有java又有scala代码时配置 maven-scala-plugin 插件打包时可以将两类代码一起打包 -->
    <plugin>
      <groupId>org.scala-tools</groupId>
      <artifactId>maven-scala-plugin</artifactId>
      <version>2.15.2</version>
      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
          </goals>
        </execution>
      </executions>
    </plugin>

    <!-- maven 打jar包需要插件 -->
    <plugin>
      <artifactId>maven-assembly-plugin</artifactId>
      <version>2.4</version>
      <configuration>
        <!-- 设置false后是去掉 MySpark-1.0-SNAPSHOT-jar-with-dependencies.jar 后的 “-jar-with-dependencies” -->
        <!--<appendAssemblyId>false</appendAssemblyId>-->
        <descriptorRefs>
          <descriptorRef>jar-with-dependencies</descriptorRef>
        </descriptorRefs>
        <archive>
          <manifest>
            <mainClass>com.bjsxt.scalaspark.core.examples.ExecuteLinuxShell</mainClass>
          </manifest>
        </archive>
      </configuration>
      <executions>
        <execution>

          <id>make-assembly</id>
          <phase>package</phase>
          <goals>
            <goal>assembly</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>
</project>

日志配置文件

  最终这个Job是需要给spark-submit调用的,所以希望有一些有用关键的信息可以通过日志输出,而不是采用System,out.println的形式输出到console端,所以要用到log.info("关键内容信息")方法,所以设置两个log的配置信息,如,只输出bug,不输出warn等,可以根据自己需求来配置,具体两个log配置文件内容如下;
  log4j.properties配置如下;

log4j.rootLogger=INFO, stdout, R
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.File=firestorm.log
log4j.appender.R.MaxFileSize=100KB
log4j.appender.R.MaxBackupIndex=1
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n
log4j.logger.com.codefutures=INFO

  log4j2.xml配置如下;

<?xml version="1.0" encoding="UTF-8"?>

<Configuration status="warn">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%m%n" />
        </Console>
    </Appenders>
    <Loggers>
        <Root level="INFO">
            <AppenderRef ref="Console" />
        </Root>
    </Loggers>
</Configuration>

连接Spark的客户端

  要想通过Spark来读取Hive库的数据,首先需要配置连接Spark的客户端,具体代码如下的SparkClient.java文件;

package cn.focusmedia.esapp.feign;

import org.apache.spark.sql.SparkSession;

public class SparkClient
{
    public static SparkSession getSpark()
    {
        SparkSession spark=SparkSession.builder().appName("SparkToES").enableHiveSupport().getOrCreate();
        return spark;
    }

}

连接ElasticSearch的客户端

  要想操作ES,首先需要配置连接ES的客户端,具体代码如下的EsClient.java文件;
   注意:这里ES集群的信息,最好自己的正式版写在配置文件内,连接去读取配置文件,这样ES集群信息变更只要修改配置文件就行了,我这样写只是为了说明问题,易读易懂!

package cn.focusmedia.esapp.feign;

import org.apache.http.HttpHost;
import lombok.extern.slf4j.Slf4j;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;

@Slf4j
public class EsClient
{
    public static RestHighLevelClient getClient()
    {
        //配置集群连接的IP和端口,正式项目是要走配置文件的,这里偷懒下,就写死吧,也方便说明问题,不要骂我代码太烂就行
        //创建HttpHost对象
        HttpHost[] myHttpHost = new HttpHost[7];
        myHttpHost[0]=new HttpHost("10.121.10.1",9200);
        myHttpHost[1]=new HttpHost("10.121.10.2",9200);
        myHttpHost[2]=new HttpHost("10.121.10.3",9200);
        myHttpHost[3]=new HttpHost("10.121.10.4",9200);
        myHttpHost[4]=new HttpHost("10.121.10.5",9200);
        myHttpHost[5]=new HttpHost("10.121.10.6",9200);
        myHttpHost[6]=new HttpHost("10.121.10.7",9200);

        //创建RestClientBuilder对象
        RestClientBuilder myRestClientBuilder=RestClient.builder(myHttpHost);

        //创建RestHighLevelClient对象
        RestHighLevelClient myclient=new RestHighLevelClient(myRestClientBuilder);

        log.info("RestClientUtil intfo create rest high level client successful!");

        return myclient;
    }
}

Spark将Hive表的数据写入ElasticSearch工具类实现

  Spark将Hive表的数据写入ElasticSearch工具类实现主要在utils/EsUtils.java文件下,我这里比较偷懒,将所有的实现方法都放在这个文件下,大家觉得不爽的话可以自己按需拆分,具体设计的内容如下;

package cn.focusmedia.esapp.utils;
import cn.focusmedia.esapp.feign.EsClient;
import cn.focusmedia.esapp.feign.SparkClient;
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.elasticsearch.action.admin.indices.alias.IndicesAliasesRequest;
import org.elasticsearch.action.admin.indices.alias.get.GetAliasesRequest;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.admin.indices.flush.FlushRequest;
import org.elasticsearch.action.admin.indices.flush.FlushResponse;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.client.GetAliasesResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.client.indices.DeleteAliasRequest;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.cluster.metadata.AliasMetaData;
import org.elasticsearch.common.collect.ImmutableOpenMap;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.spark.sql.EsSparkSQL;
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
import org.junit.Test;
import java.io.IOException;
import java.util.List;

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import org.spark_project.guava.collect.ImmutableMap;
import scala.collection.mutable.Map;


@Slf4j
public class EsUtils
{
    static RestHighLevelClient myClient= EsClient.getClient();  //获取操作ES的

    //查询ES索引是否存在
    @Test
    public static boolean exsitsIndex(String index) throws IOException
    {
        //准备request对象
        GetIndexRequest myrequest=new GetIndexRequest(index);
        //通过client去操作
        boolean myresult = myClient.indices().exists(myrequest, RequestOptions.DEFAULT);
        //输出结果
        log.info("The index:"+index+" is exist? :"+myresult);
        return myresult;
    }

    //创建ES索引
    @Test
    public static CreateIndexResponse creatIndex(String index,String index_mapping) throws IOException
    {
        log.info("The  index name will be created : "+index);

        //将准备好的setting和mapping封装到一个request对象内
        CreateIndexRequest myrequest = new CreateIndexRequest(index).source(index_mapping, XContentType.JSON);

        //通过client对象去连接ES并执行创建索引
        CreateIndexResponse myCreateIndexResponse=myClient.indices().create(myrequest, RequestOptions.DEFAULT);

        //输出结果
        log.info("The index : "+index+" was created response is "+ myCreateIndexResponse.isAcknowledged());

        return myCreateIndexResponse;
    }

    //删除ES索引
    @Test
    public static AcknowledgedResponse deleteIndex(String index) throws IOException {
        //准备request对象
        DeleteIndexRequest myDeleteIndexRequest = new DeleteIndexRequest();
        myDeleteIndexRequest.indices(index);

        //通过client对象执行
        AcknowledgedResponse myAcknowledgedResponse = myClient.indices().delete(myDeleteIndexRequest,RequestOptions.DEFAULT);

        //获取返回结果
        log.info("The index :"+index+"create response is "+myAcknowledgedResponse.isAcknowledged());
        return  myAcknowledgedResponse;
        //System.out.println(myAcknowledgedResponse.isAcknowledged());
    }

    //数据写入ES
    public static void tableToEs(String index,String index_auto_create,String es_mapping_id,String table_name,String es_nodes)
    {
        SparkSession spark = SparkClient.getSpark();
        Dataset<Row> table = spark.table(table_name).repartition(60);
        JavaEsSparkSQL.saveToEs(table,index, ImmutableMap.of("es.index.auto.create", index_auto_create,"es.resource", index, "es.mapping.id" ,es_mapping_id,"es.nodes" ,es_nodes));
        log.info("Spark data from hive to ES index: "+index+" is over,go to alias index! ");
        spark.stop();
    }

    //flush下ES新的index数据
    public static void flushIndex(String index) throws IOException
    {
        FlushRequest myFlushRequest =new FlushRequest(index);
        FlushResponse myFlushResponse=myClient.indices().flush(myFlushRequest,RequestOptions.DEFAULT);
        int totalShards =myFlushResponse.getTotalShards();
        log.info("index: "+index+" has"+ totalShards +"flush over! ");
    }

    //ES别名操作,无缝连接
    //ES获取别名
    public static String getAlias(String alias) throws IOException
    {
        GetAliasesRequest requestWithAlias = new GetAliasesRequest(alias);
        GetAliasesResponse response = myClient.indices().getAlias(requestWithAlias, RequestOptions.DEFAULT);
        String AliasesString = response.getAliases().toString();
        //注意,这里要求先生成一版数据设置成别名media_all_rs_v0,否则截取字符串会报错
       //kibana操作:
       //POST _aliases
       //{
       //     "actions" : [{"add" : {"index" : "索引名字" , "alias" : "别名"}}]
       //}
       
        String alias_index_name = AliasesString.substring(AliasesString.indexOf("{") + 1, AliasesString.indexOf("="));
        return alias_index_name;
    }

    //ES更新别名
    public static void indexUpdateAlias(String index,String index_alias) throws IOException
    {
        String old_index_name=EsUtils.getAlias(index_alias);
        log.info(index_alias+ " old index is "+old_index_name);

        //删除别名映射的老的index
        DeleteAliasRequest myDeleteAliasRequest = new DeleteAliasRequest(old_index_name, index_alias);
        org.elasticsearch.client.core.AcknowledgedResponse myDeleteResponse=myClient.indices().deleteAlias(myDeleteAliasRequest, RequestOptions.DEFAULT);
        boolean deletealisaacknowledged = myDeleteResponse.isAcknowledged();
        log.info("delete index successfully? " + deletealisaacknowledged);

        //新建ES新的index别名
        IndicesAliasesRequest request = new IndicesAliasesRequest();
        IndicesAliasesRequest.AliasActions aliasAction = new IndicesAliasesRequest.AliasActions(IndicesAliasesRequest.AliasActions.Type.ADD).index(index).alias(index_alias);
        request.addAliasAction(aliasAction);
        org.elasticsearch.action.support.master.AcknowledgedResponse indicesAliasesResponse = myClient.indices().updateAliases(request, RequestOptions.DEFAULT);
        boolean createaliasacknowledged = indicesAliasesResponse.isAcknowledged();
        log.info("create index successfully? "+createaliasacknowledged);

        String now_index=EsUtils.getAlias(index_alias);
        log.info(index_alias+ " now index is "+now_index);

        if(now_index.equals(index))
        {
            log.info("index: "+index+ " alias update successfully!");
        }

    }

}

主函数调用工具类实现整体功能

  主函数的实现的 功能顺序下所示;

  1. 创建索引
  2. spark导入数据
  3. flush下新的index数据
  4. 获取目前的索引别名对应的索引名字,该索引名马上要失效
  5. 替换最新数据别名
  6. 确认别名成功切换后清除老的索引

  具体代码如下的app.java文件:

   注意,我这里为了方便,就把要求的变量直接定义到了主函数内,其实真正的项目,最好能把这些变量剥离出来放在一个xml文件内作为配置文件,然后在主函数内去读取这个xml文件 ,这样才是最好的,因为后续新的Hive表要求抽取到ES的话,需要修改相应的xml配置文件即可,非常的方便;这里重点在实现功能,易读易懂,锦上添花的事看大家自己吧!

package cn.focusmedia.esapp;

import cn.focusmedia.esapp.feign.EsClient;
import cn.focusmedia.esapp.utils.EsUtils;
import org.elasticsearch.client.RestHighLevelClient;

import java.io.IOException;

/**
 * Hello world!
 *
 */

public class App 
{
    //申明新的es索引名字
    private static String index=null;
    static
    {
        index = "media_all_rs_v" + System.currentTimeMillis();
    }

    //es别名
    static String index_alias="media_all_rs_v0";

    //是否根据Hive表结构自动创建索引,一般写false,怕结构变形,可以通过根据mapping来创建规范的索引
    static String index_auto_create="false";

    //指定es index的id
    static String es_mapping_id ="entrance_key";

    //Hive内的表结构
    static String table_name="dw.app_rs_media_galaxy_entrance_key";

    //es集群节点集合
    static String es_nodes="10.121.10.1:9200,10.121.10.2:9200,10.121.10.3:9200,10.121.10.4:9200,10.121.10.5:9200,10.121.10.6:9200,10.121.10.7:9200";

    //es index的mapping结构
    static String index_mapping="\n" +
            "{\n" +
            "  \"settings\": \n" +
            "  {\n" +
            "      \"number_of_replicas\": 3\n" +
            "    , \"number_of_shards\": 1\n" +
            "    ,\"max_result_window\" : 1000000\n" +
            "  }\n" +
            "  \n" +
            "  , \"mappings\": \n" +
            "  {\n" +
            "    \"properties\" : \n" +
            "    {\n" +
            "     \"amap_province_code\":\n" +
            "    {\n" +
            "      \"type\": \"keyword\"\n" +
            "    }\n" +
            "    ,\"amap_province_name\":\n" +
            "    {\n" +
            "      \"type\":\"keyword\"\n" +
            "    }\n" +
            "    ,\"amap_city_code\":\n" +
            "    {\n" +
            "     \"type\": \"keyword\"\n" +
            "    }                 \n" +
            "    ,\"amap_city_name\":\n" +
            "    {\n" +
            "      \"type\":\"keyword\"\n" +
            "    }\n" +
            "    ,\"amap_district_code\":\n" +
            "    {\n" +
            "      \"type\": \"keyword\"\n" +
            "    }\n" +
            "    ,\"amap_district_name\" :\n" +
            "    {\n" +
            "      \"type\":\"keyword\"\n" +
            "    }\n" +
            "    ,\"group_name\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }\n" +
            "    ,\"building_amap_address\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }\n" +
            "     ,\"building_map\":\n" +
            "     {\n" +
            "       \"type\": \"geo_point\"\n" +
            "     } \t\n" +
            "\t   ,\"reside_rate\":   \n" +
            "    {\n" +
            "      \"type\": \"double\"\n" +
            "    }\n" +
            "    ,\"rooms\":\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }\n" +
            "    ,\"reside_rooms\" :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }\n" +
            "    ,\"floor_num_min\" :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }                 \n" +
            "    ,\"floor_num_max\" :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }               \n" +
            "    ,\"building_scope_code\":\n" +
            "    {\n" +
            "       \"type\": \"keyword\"\n" +
            "    }\n" +
            "    ,\"building_scope_name\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    } \n" +
            "    ,\"audiences_code\" :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                \n" +
            "    ,\"audiences_name\" :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }\n" +
            "    ,\"building_level_code\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }            \n" +
            "    ,\"building_level_name\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }        \n" +
            "    ,\"hire_price\" :   \n" +
            "    {\n" +
            "      \"type\": \"double\"\n" +
            "    }                     \n" +
            "    ,\"sell_price\":   \n" +
            "    {\n" +
            "      \"type\": \"double\"\n" +
            "    }                     \n" +
            "    ,\"open_quotation\":\n" +
            "    {\n" +
            "      \"type\": \"date\"\n" +
            "       ,  \"format\": \"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis\"\n" +
            "    }\n" +
            "\n" +
            "    ,\"building_type_1_code\":\n" +
            "    {\n" +
            "       \"type\": \"long\"  \n" +
            "    }     \n" +
            "    ,\"building_type_1_name\":\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                \n" +
            "    ,\"reside_time\":\n" +
            "    {\n" +
            "      \"type\": \"date\"\n" +
            "       ,\"format\": \"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis\"\n" +
            "    }\n" +
            "    ,\"under_carport_count\":\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }\n" +
            "    ,\"over_carport_count\" :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }            \n" +
            "    ,\"surrounding_key\" :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                               \n" +
            "    ,\"surroundings_code\" :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                            \n" +
            "    ,\"surroundings\"  :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    } \n" +
            "    ,\"building_info_old_frame_taboos_code\"  :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }  \n" +
            "    ,\"building_info_old_frame_taboos_name\"  :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }  \t\n" +
            "\t \n" +
            "      ,\"zone_key\" : \n" +
            "      {\n" +
            "         \"type\": \"keyword\"\n" +
            "      }         \n" +
            "      ,\"sub_building_id\":\n" +
            "      {\n" +
            "        \"type\": \"long\"\n" +
            "      }              \n" +
            "      ,\"sub_building_name\":\n" +
            "      {\n" +
            "         \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "      }\n" +
            "      ,\"emplacement_key\" : \n" +
            "      {\n" +
            "         \"type\": \"keyword\"\n" +
            "      }                                                                           \n" +
            "    ,\"site_id\" :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }                                   \n" +
            "    ,\"site_name\"  :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                             \n" +
            "    ,\"entrance_key\"   :\n" +
            "    {\n" +
            "       \"type\": \"keyword\"\n" +
            "    }                     \n" +
            "    ,\"unit_id\"  :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }                                  \n" +
            "    ,\"unit_name\"   :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                            \n" +
            "    ,\"install_name_id\"   :\n" +
            "    {\n" +
            "      \"type\": \"long\"\n" +
            "    }                         \n" +
            "    ,\"install_name\"   :\n" +
            "    {\n" +
            "       \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "    }                         \n" +
            "    ,\"lcd_amount\" :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }            \n" +
            "    ,\"smart_amount\":\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }                               \n" +
            "    ,\"kids_amount\" :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }                               \n" +
            "    ,\"juran_amount\"   :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    } \n" +
            "    ,\"frame_elevator_amount\"   :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }  \n" +
            "    ,\"frame_in_elevator_amount\"   :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }  \n" +
            "    ,\"frame_hall_amount\"   :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }  \n" +
            "    ,\"frame_in_hall_amount\" :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }  \t\n" +
            "    ,\"frame1_amount\":\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }            \n" +
            "    ,\"frame3_amount\"  :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }                            \n" +
            "    ,\"frame_windows_amount\"  :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }   \n" +
            "    ,\"frame_jimi_amount\"  :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    }   \n" +
            "    ,\"frame_amount\"  :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    } \n" +
            "    ,\"building_amount\"  :\n" +
            "    {\n" +
            "      \"type\": \"integer\"\n" +
            "    } \n" +
            "    ,\"event_day\": {\n" +
            "          \"type\": \"text\",\n" +
            "          \"fields\": {\n" +
            "            \"keyword\": {\n" +
            "              \"type\": \"keyword\",\n" +
            "              \"ignore_above\": 256\n" +
            "            }\n" +
            "          }\n" +
            "        },\n" +
            "        \"event_hour\": {\n" +
            "          \"type\": \"text\",\n" +
            "          \"fields\": {\n" +
            "            \"keyword\": {\n" +
            "              \"type\": \"keyword\",\n" +
            "              \"ignore_above\": 256\n" +
            "            }\n" +
            "          }\n" +
            "        },\n" +
            "        \"event_week\": {\n" +
            "          \"type\": \"long\"\n" +
            "        }\t\n" +
            "    ,\"locations_info\":                 \n" +
            "    {\n" +
            "       \"type\": \"nested\" ,\n" +
            "      \"properties\":\n" +
            "      {\n" +
            "\t      \"building_id\" :\n" +
            "          {\n" +
            "            \"type\": \"long\"\n" +
            "          }                    \n" +
            "          ,\"building_name\" :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }\n" +
            "\t\t   ,\"stage_key\"      :\n" +
            "          {\n" +
            "             \"type\": \"keyword\"\n" +
            "          }    \n" +
            "          ,\"elevator_id\":\n" +
            "          {\n" +
            "            \"type\": \"long\"\n" +
            "          }\n" +
            "          ,\"elevator_name\"  :\n" +
            "          {\n" +
            "           \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                          \n" +
            "          ,\"elevator_purpose_code\":\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }             \n" +
            "          ,\"elevator_purpose_name\":\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }   \n" +
            "          ,\"location_attribute_code\":\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }   \n" +
            "          ,\"location_attribute_name\":\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }             \n" +
            "            ,\"stage_id\"  :\n" +
            "            {\n" +
            "              \"type\": \"long\"\n" +
            "            }                 \n" +
            "            ,\"stage_name\" :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                                \n" +
            "            ,\"location_key\"    :\n" +
            "          {\n" +
            "             \"type\": \"keyword\"\n" +
            "          }                           \n" +
            "            ,\"location_id\"   :\n" +
            "            {\n" +
            "              \"type\": \"long\"\n" +
            "            }             \n" +
            "            ,\"device_cyber_code\" :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                         \n" +
            "            ,\"device_sn\"   :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                               \n" +
            "            ,\"main_suit_kind_code\"  :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                      \n" +
            "            ,\"main_suit_kind_name\":\n" +
            "          {\n" +
            "            \"type\": \"keyword\"\n" +
            "          }                        \n" +
            "            ,\"suit_kind_code\" :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                                   \n" +
            "            ,\"suit_kind_name\"  :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                                  \n" +
            "            ,\"display_type_code\"   :\n" +
            "          {\n" +
            "             \"type\": \"keyword\"  \n" +
            "          }                      \n" +
            "            \n" +
            "            ,\"display_type_name\"    :\n" +
            "          {\n" +
            "            \"type\": \"keyword\"\n" +
            "          }       \n" +
            "            ,\"install_area_kind\":\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }             \n" +
            "            ,\"install_area\" :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                 \n" +
            "            ,\"install_location_kind\"      :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }   \n" +
            "            ,\"install_location\"     :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }         \n" +
            "            ,\"location_detail\"    :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }           \n" +
            "            ,\"location_desc\"   :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }              \n" +
            "            ,\"device_style_id\"   :\n" +
            "          {\n" +
            "            \"type\": \"integer\"\n" +
            "          }                        \n" +
            "            ,\"device_style_name\"  :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }           \n" +
            "            ,\"owner_company_code\"  :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                   \n" +
            "            ,\"owner_company_name\"   :\n" +
            "          {\n" +
            "             \"type\": \"text\"  , \"analyzer\": \"ik_smart\"\n" +
            "          }                  \n" +
            "            ,\"product_line_code\"  :\n" +
            "          {\n" +
            "            \"type\": \"keyword\"\n" +
            "          }\n" +
            "            ,\"install_status_code\"   :\n" +
            "          {\n" +
            "            \"type\": \"integer\"\n" +
            "          }  \n" +
            "            ,\"install_status_name\"   :\n" +
            "          {\n" +
            "            \"type\": \"keyword\"\n" +
            "          }  \n" +
            "            ,\"is_sale\"   :\n" +
            "          {\n" +
            "            \"type\": \"integer\"\n" +
            "          }  \n" +
            "            ,\"is_back\"   :\n" +
            "          {\n" +
            "            \"type\": \"integer\"\n" +
            "          }  \n" +
            "            ,\"is_available\"   :\n" +
            "          {\n" +
            "            \"type\": \"integer\"\n" +
            "          }  \t\t  \n" +
            "      }\n" +
            "   }    \n" +
            "      \n" +
            "    }\n" +
            "    \n" +
            "  }\n" +
            "}";

    public static void main( String[] args ) throws IOException
    {


        //创建索引
        if(EsUtils.exsitsIndex(index))
            EsUtils.deleteIndex(index);
        EsUtils.creatIndex(index,index_mapping);

        //spark导入数据
        //tableToEs(String index,String index_auto_create,String es_mapping_id,String table_name,String es_nodes)
        EsUtils.tableToEs(index,index_auto_create,es_mapping_id,table_name,es_nodes);


        //flush下新的index数据
        EsUtils.flushIndex(index);

        //获取目前的索引别名对应的索引名字,该索引名马上要失效
        String old_index=EsUtils.getAlias(index_alias);

        //替换最新数据别名
        EsUtils.indexUpdateAlias(index,index_alias);

        //确认别名成功切换后清除老的索引
        EsUtils.deleteIndex(old_index);
    }
}

打成Jar包并部署

  将调试无误的项目打成Jar包,如果还不会打Jar包,可以参考博客IntelliJ IDEA将代码打成Jar包的方式,这里我打成的Jar包名字为ETL_hive_to_es_galaxy.jar;
  ETL_hive_to_es_galaxy.jar上传到hdfs的/app/hive_to_es_galaxy/etl_jar/ETL_hive_to_es_galaxy.jar路径下,然后写一个spark-submit调用的shell脚本spark_on_hive_and_es.sh,具体如下:

#!/bin/bash

cur_dir=`pwd`

spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m  --class cn.focusmedia.esapp.App  hdfs:///app/hive_to_es_galaxy/etl_jar/ETL_hive_to_es_galaxy.jar
dq_check_flag=$?
if [ $dq_check_flag -eq 0 ];then
    echo "data quality check run succeed!"

else
    echo "data quality check run failed!"
    ## 以下内容是我们设计的错误报错的钉钉报警,这里可以改成你们自己的报警措施
    cd ${cur_dir}/../src/ding_talk_warning_report_py/main/
    python3 ding_talk_with_agency.py 215
    exit 3
fi


调度shell脚本

  最后就是将这个spark_on_hive_and_es.sh脚本调度起来,如用Azkaban调度,设置自己需求的调度频率;

总 结

  采用Spark将Hive表的数据写入ElasticSearch,速度较快,可以作为离线数据从数据仓库Hive写入ElasticSearch的首席参考方案,稳定,无缝连接,且快速;至于丢失的一环,如何校验Hive的数据是否准确的通过Spark写入了ES,请参考本文的目录的文章3;

  • 3
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 7
    评论
### 回答1: 这个项目实战的目标是使用Java版本SparkHive数据写入Elasticsearch。具体步骤如下: 1. 首先,需要在Spark中创建一个JavaSparkContext对象,并且设置相关的配置,比如Elasticsearch的地址和端口号等。 2. 接下来,需要使用HiveContext对象来读取Hive数据。可以使用HiveContext的sql方法来执行Hive SQL语句,或者使用HiveContext的table方法来读取Hive数据。 3. 读取Hive数据后,需要将数据转换成Elasticsearch的格式。可以使用JavaRDD的map方法来实现数据的转换。 4. 转换完成后,需要将数据写入Elasticsearch。可以使用JavaRDD的foreachPartition方法来实现数据的批量写入。 5. 最后,记得关闭JavaSparkContext对象。 以上就是使用Java版本SparkHive数据写入Elasticsearch的步骤。需要注意的是,具体实现过程中还需要考虑一些细节问题,比如数据类型的转换、数据的去重等。 ### 回答2: 在实现SparkHive数据写入Elasticsearch的过程中,首先需要搭建好相关的环境,包括Hadoop、HiveElasticsearch等。然后,根据Spark的API接口,可以编写相关的Java代码来实现将Hive数据写入Elasticsearch的操作。 具体实现步骤如下: 1. 导入相关依赖:在Maven项目中,需要添加以下依赖: ```xml <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scalaVersion}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scalaVersion}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-spark-20_2.11</artifactId> <version>${elasticsearch.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive.version}</version> </dependency> ``` 其中,${scalaVersion}、${spark.version}、${elasticsearch.version}和${hive.version}需要根据实际情况进行替换。 2. 初始化SparkConf和SparkSession对象:在Java代码中,需要先初始化SparkConf和SparkSession对象: ```java SparkConf conf = new SparkConf().setAppName("Spark-Hive-Elasticsearch"); SparkSession spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); ``` 其中,setAppName用于设置Spark应用程序的名称,getOrCreate用于获取一个已有的Spark或创建一个新的Spark。 3. 读取Hive数据:可以使用SparkSession的read方法读取Hive数据,如下所示: ```java DataFrame df = spark.read().table("mytable"); ``` 其中,mytable为Hive的名称。 4. 配置Elasticsearch索引:在将Hive数据写入Elasticsearch时,需要配置相关的索引,如下所示: ```java Map<String, String> esConfig = new HashMap<>(); esConfig.put("es.nodes", "localhost"); esConfig.put("es.port", "9200"); esConfig.put("es.resource", "myindex/mytype"); ``` 其中,es.nodeses.port用于配置Elasticsearch的地址和端口,es.resource用于指定Elasticsearch的索引名称和类型名称。 5. 将Hive数据写入Elasticsearch:可以使用DataFrame的write方法将Hive数据写入Elasticsearch,如下所示: ```java df.write().format("org.elasticsearch.spark.sql").mode(SaveMode.Append).options(esConfig).save(); ``` 其中,format指定了保存的格式为Elasticsearch格式,mode指定了保存的模式为Append,options指定了保存的配置项。 通过上述步骤,即可实现SparkHive数据写入Elasticsearch的操作。当然,在实际应用过程中,还需要根据具体需求进行相关的优化和调整。 ### 回答3: 这个项目实战的主要目的是将Hive数据写入ElasticsearchHive是Hadoop中的数据仓库,而Elasticsearch则是一个高性能的搜索和分析引擎。将这两个系统结合起来可以实现更好的数据分析和可视化。 在开始实现之前,需要先搭建好Hadoop和Elasticsearch环境,并且熟悉Java编程语言和Spark框架。接下来,可以按照以下步骤进行实现。 第一步:配置Maven项目并添加SparkElasticsearch的依赖库。在pom.xml文件中添加以下代码: ``` <!-- Spark --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.4.5</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.4.5</version> </dependency> <!-- Elasticsearch --> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-spark-20_2.11</artifactId> <version>7.6.2</version> </dependency> ``` 第二步:编写代码读取Hive数据。使用Spark SQL读取Hive数据并将其转换为DataFrame,例如: ``` String tableName = "hive_table_name"; Dataset<Row> df = sparkSession.table(tableName); ``` 第三步:将DataFrame中的数据写入Elasticsearch。使用Elasticsearch Spark库提供的API来将数据写入Elasticsearch,例如: ``` // 配置Elasticsearch参数 Map<String, String> esConfigs = new HashMap<>(); esConfigs.put("es.nodes", "localhost"); esConfigs.put("es.port", "9200"); esConfigs.put("es.mapping.id", "id"); // 写入数据 df.write().format("org.elasticsearch.spark.sql") .options(esConfigs).mode("overwrite") .save("index_name/document_type"); ``` 最后一步:运行代码并验证数据是否已经成功写入Elasticsearch。可以使用Kibana界面进行数据可视化展示和分析。 通过以上步骤,就可以成功实现将Hive数据写入Elasticsearch的功能。这个项目实战可以帮助开发人员更好地理解和应用SparkElasticsearch技术,并且提升数据分析和可视化的能力。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

╭⌒若隐_RowYet——大数据

谢谢小哥哥,小姐姐的巨款

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值