目 录
-
项目实战——将Hive表的数据直接导入ElasticSearch
此篇文章不用写代码,简单粗暴,但是相对没有那么灵活;底层采用MapReduce计算框架,导入速度相对较慢! -
项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)
此篇文章需要Java代码,实现功能和篇幅类似,直接Java一站式解决Hive内用Spark取数,新建ES索引,灌入数据,并且采用ES别名机制,实现ES数据更新的无缝更新,底层采用Spark计算框架,导入速度相对文章1的做法较快的多!; -
项目实战——钉钉报警验证ElasticSearch和Hive数据仓库内的数据质量(Java版本)
此篇文章主要选取关键性指标,数据校验数据源Hive和目标ES内的数据是否一致; -
项目实战——Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本)
此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内; -
项目实战(生产环境部署上线)——参数配置化Spark将Hive表的数据写入需要用户名密码认证的ElasticSearch(Java版本))
此篇文章主要讲述如何通过spark将hive数据写入带账号密码权限认证的ElasticSearch 内,同时而是,spark,es建索引参数配置化,每次新增一张表同步到es只需要新增一个xml配置文件即可,也是博主生产环境运用的java代码,弥补下很多老铁吐槽方法4的不足。
综述:
1.如果感觉编码能力有限,又想用到Hive数据导入ElasticSearch,可以考虑文章1;
2.如果有编码能力,个人建议采用文章2和文章3的组合情况(博主推荐),作为离线或者近线数据从数据仓库Hive导入ElasticSearch的架构方案,并且此次分享的Java代码为博主最早实现的版本1,主要在于易懂,实现功能,学者们可以二次加工,请不要抱怨代码写的烂;
3.如果是elasticsearch是自带账号密码权限认证的,如云产品或者自己设置了账号密码认证的,那么办法,只能用文章4了;
4.如果部署上线,还是要看文章5。
- 本人Hive版本:2.3.5
- 本人ES版本:7.7.1
- 本人Spark版本:2.3.3
项目树
总体项目树图谱如图1所示,编程软件:IntelliJ IDEA 2019.3 x64
,采用Maven
架构;
项目链接地址:项目实战——Spark将Hive表的数据写入ElasticSearch(Java版本)
feign
:连接ES和Spark客户端相关的Java类;utils
:操作ES和Spark相关的Java类;resources
:日志log
的配置类;pom.xml
:Maven配置文件;
Maven配置文件pox.xml
该项目使用到的Maven依赖包存在pom.xml
上,具体如下所示;.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>SparkOnHiveToEs_v1</artifactId>
<version>1.0-SNAPSHOT</version>
<name>SparkOnHiveToEs_v1</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch -->
<!--ES本身的依赖-->
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>7.7.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
<!--ES高级API,用来连接ES的Client等操作-->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.7.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<!--junit,Test测试使用-->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
<!--lombok ,用来自动生成对象类的构造函数,get,set属性等-->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>RELEASE</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<!--jackson,用来封装json-->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.11.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20 -->
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>7.7.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.3.3</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.9.1</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.9.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 在maven项目中既有java又有scala代码时配置 maven-scala-plugin 插件打包时可以将两类代码一起打包 -->
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- maven 打jar包需要插件 -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<!-- 设置false后是去掉 MySpark-1.0-SNAPSHOT-jar-with-dependencies.jar 后的 “-jar-with-dependencies” -->
<!--<appendAssemblyId>false</appendAssemblyId>-->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.bjsxt.scalaspark.core.examples.ExecuteLinuxShell</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>assembly</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
日志配置文件
最终这个Job是需要给spark-submit调用的,所以希望有一些有用关键的信息可以通过日志输出,而不是采用System,out.println
的形式输出到console端,所以要用到log.info("关键内容信息")
方法,所以设置两个log
的配置信息,如,只输出bug,不输出warn等,可以根据自己需求来配置,具体两个log配置文件内容如下;
log4j.properties
配置如下;
log4j.rootLogger=INFO, stdout, R
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.File=firestorm.log
log4j.appender.R.MaxFileSize=100KB
log4j.appender.R.MaxBackupIndex=1
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n
log4j.logger.com.codefutures=INFO
log4j2.xml
配置如下;
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="warn">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%m%n" />
</Console>
</Appenders>
<Loggers>
<Root level="INFO">
<AppenderRef ref="Console" />
</Root>
</Loggers>
</Configuration>
连接Spark的客户端
要想通过Spark来读取Hive库的数据,首先需要配置连接Spark的客户端,具体代码如下的SparkClient.java
文件;
package cn.focusmedia.esapp.feign;
import org.apache.spark.sql.SparkSession;
public class SparkClient
{
public static SparkSession getSpark()
{
SparkSession spark=SparkSession.builder().appName("SparkToES").enableHiveSupport().getOrCreate();
return spark;
}
}
连接ElasticSearch的客户端
要想操作ES,首先需要配置连接ES的客户端,具体代码如下的EsClient.java
文件;
注意:这里ES集群的信息,最好自己的正式版写在配置文件内,连接去读取配置文件,这样ES集群信息变更只要修改配置文件就行了,我这样写只是为了说明问题,易读易懂!
package cn.focusmedia.esapp.feign;
import org.apache.http.HttpHost;
import lombok.extern.slf4j.Slf4j;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;
@Slf4j
public class EsClient
{
public static RestHighLevelClient getClient()
{
//配置集群连接的IP和端口,正式项目是要走配置文件的,这里偷懒下,就写死吧,也方便说明问题,不要骂我代码太烂就行
//创建HttpHost对象
HttpHost[] myHttpHost = new HttpHost[7];
myHttpHost[0]=new HttpHost("10.121.10.1",9200);
myHttpHost[1]=new HttpHost("10.121.10.2",9200);
myHttpHost[2]=new HttpHost("10.121.10.3",9200);
myHttpHost[3]=new HttpHost("10.121.10.4",9200);
myHttpHost[4]=new HttpHost("10.121.10.5",9200);
myHttpHost[5]=new HttpHost("10.121.10.6",9200);
myHttpHost[6]=new HttpHost("10.121.10.7",9200);
//创建RestClientBuilder对象
RestClientBuilder myRestClientBuilder=RestClient.builder(myHttpHost);
//创建RestHighLevelClient对象
RestHighLevelClient myclient=new RestHighLevelClient(myRestClientBuilder);
log.info("RestClientUtil intfo create rest high level client successful!");
return myclient;
}
}
Spark将Hive表的数据写入ElasticSearch工具类实现
Spark将Hive表的数据写入ElasticSearch工具类实现主要在utils/EsUtils.java
文件下,我这里比较偷懒,将所有的实现方法都放在这个文件下,大家觉得不爽的话可以自己按需拆分,具体设计的内容如下;
package cn.focusmedia.esapp.utils;
import cn.focusmedia.esapp.feign.EsClient;
import cn.focusmedia.esapp.feign.SparkClient;
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.elasticsearch.action.admin.indices.alias.IndicesAliasesRequest;
import org.elasticsearch.action.admin.indices.alias.get.GetAliasesRequest;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.admin.indices.flush.FlushRequest;
import org.elasticsearch.action.admin.indices.flush.FlushResponse;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.client.GetAliasesResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.client.indices.DeleteAliasRequest;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.cluster.metadata.AliasMetaData;
import org.elasticsearch.common.collect.ImmutableOpenMap;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.spark.sql.EsSparkSQL;
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
import org.junit.Test;
import java.io.IOException;
import java.util.List;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import org.spark_project.guava.collect.ImmutableMap;
import scala.collection.mutable.Map;
@Slf4j
public class EsUtils
{
static RestHighLevelClient myClient= EsClient.getClient(); //获取操作ES的
//查询ES索引是否存在
@Test
public static boolean exsitsIndex(String index) throws IOException
{
//准备request对象
GetIndexRequest myrequest=new GetIndexRequest(index);
//通过client去操作
boolean myresult = myClient.indices().exists(myrequest, RequestOptions.DEFAULT);
//输出结果
log.info("The index:"+index+" is exist? :"+myresult);
return myresult;
}
//创建ES索引
@Test
public static CreateIndexResponse creatIndex(String index,String index_mapping) throws IOException
{
log.info("The index name will be created : "+index);
//将准备好的setting和mapping封装到一个request对象内
CreateIndexRequest myrequest = new CreateIndexRequest(index).source(index_mapping, XContentType.JSON);
//通过client对象去连接ES并执行创建索引
CreateIndexResponse myCreateIndexResponse=myClient.indices().create(myrequest, RequestOptions.DEFAULT);
//输出结果
log.info("The index : "+index+" was created response is "+ myCreateIndexResponse.isAcknowledged());
return myCreateIndexResponse;
}
//删除ES索引
@Test
public static AcknowledgedResponse deleteIndex(String index) throws IOException {
//准备request对象
DeleteIndexRequest myDeleteIndexRequest = new DeleteIndexRequest();
myDeleteIndexRequest.indices(index);
//通过client对象执行
AcknowledgedResponse myAcknowledgedResponse = myClient.indices().delete(myDeleteIndexRequest,RequestOptions.DEFAULT);
//获取返回结果
log.info("The index :"+index+"create response is "+myAcknowledgedResponse.isAcknowledged());
return myAcknowledgedResponse;
//System.out.println(myAcknowledgedResponse.isAcknowledged());
}
//数据写入ES
public static void tableToEs(String index,String index_auto_create,String es_mapping_id,String table_name,String es_nodes)
{
SparkSession spark = SparkClient.getSpark();
Dataset<Row> table = spark.table(table_name).repartition(60);
JavaEsSparkSQL.saveToEs(table,index, ImmutableMap.of("es.index.auto.create", index_auto_create,"es.resource", index, "es.mapping.id" ,es_mapping_id,"es.nodes" ,es_nodes));
log.info("Spark data from hive to ES index: "+index+" is over,go to alias index! ");
spark.stop();
}
//flush下ES新的index数据
public static void flushIndex(String index) throws IOException
{
FlushRequest myFlushRequest =new FlushRequest(index);
FlushResponse myFlushResponse=myClient.indices().flush(myFlushRequest,RequestOptions.DEFAULT);
int totalShards =myFlushResponse.getTotalShards();
log.info("index: "+index+" has"+ totalShards +"flush over! ");
}
//ES别名操作,无缝连接
//ES获取别名
public static String getAlias(String alias) throws IOException
{
GetAliasesRequest requestWithAlias = new GetAliasesRequest(alias);
GetAliasesResponse response = myClient.indices().getAlias(requestWithAlias, RequestOptions.DEFAULT);
String AliasesString = response.getAliases().toString();
//注意,这里要求先生成一版数据设置成别名media_all_rs_v0,否则截取字符串会报错
//kibana操作:
//POST _aliases
//{
// "actions" : [{"add" : {"index" : "索引名字" , "alias" : "别名"}}]
//}
String alias_index_name = AliasesString.substring(AliasesString.indexOf("{") + 1, AliasesString.indexOf("="));
return alias_index_name;
}
//ES更新别名
public static void indexUpdateAlias(String index,String index_alias) throws IOException
{
String old_index_name=EsUtils.getAlias(index_alias);
log.info(index_alias+ " old index is "+old_index_name);
//删除别名映射的老的index
DeleteAliasRequest myDeleteAliasRequest = new DeleteAliasRequest(old_index_name, index_alias);
org.elasticsearch.client.core.AcknowledgedResponse myDeleteResponse=myClient.indices().deleteAlias(myDeleteAliasRequest, RequestOptions.DEFAULT);
boolean deletealisaacknowledged = myDeleteResponse.isAcknowledged();
log.info("delete index successfully? " + deletealisaacknowledged);
//新建ES新的index别名
IndicesAliasesRequest request = new IndicesAliasesRequest();
IndicesAliasesRequest.AliasActions aliasAction = new IndicesAliasesRequest.AliasActions(IndicesAliasesRequest.AliasActions.Type.ADD).index(index).alias(index_alias);
request.addAliasAction(aliasAction);
org.elasticsearch.action.support.master.AcknowledgedResponse indicesAliasesResponse = myClient.indices().updateAliases(request, RequestOptions.DEFAULT);
boolean createaliasacknowledged = indicesAliasesResponse.isAcknowledged();
log.info("create index successfully? "+createaliasacknowledged);
String now_index=EsUtils.getAlias(index_alias);
log.info(index_alias+ " now index is "+now_index);
if(now_index.equals(index))
{
log.info("index: "+index+ " alias update successfully!");
}
}
}
主函数调用工具类实现整体功能
主函数的实现的 功能顺序下所示;
- 创建索引
- spark导入数据
- flush下新的index数据
- 获取目前的索引别名对应的索引名字,该索引名马上要失效
- 替换最新数据别名
- 确认别名成功切换后清除老的索引
具体代码如下的app.java
文件:
注意,我这里为了方便,就把要求的变量直接定义到了主函数内,其实真正的项目,最好能把这些变量剥离出来放在一个xml文件内作为配置文件,然后在主函数内去读取这个xml文件 ,这样才是最好的,因为后续新的Hive表要求抽取到ES的话,需要修改相应的xml配置文件即可,非常的方便;这里重点在实现功能,易读易懂,锦上添花的事看大家自己吧!
package cn.focusmedia.esapp;
import cn.focusmedia.esapp.feign.EsClient;
import cn.focusmedia.esapp.utils.EsUtils;
import org.elasticsearch.client.RestHighLevelClient;
import java.io.IOException;
/**
* Hello world!
*
*/
public class App
{
//申明新的es索引名字
private static String index=null;
static
{
index = "media_all_rs_v" + System.currentTimeMillis();
}
//es别名
static String index_alias="media_all_rs_v0";
//是否根据Hive表结构自动创建索引,一般写false,怕结构变形,可以通过根据mapping来创建规范的索引
static String index_auto_create="false";
//指定es index的id
static String es_mapping_id ="entrance_key";
//Hive内的表结构
static String table_name="dw.app_rs_media_galaxy_entrance_key";
//es集群节点集合
static String es_nodes="10.121.10.1:9200,10.121.10.2:9200,10.121.10.3:9200,10.121.10.4:9200,10.121.10.5:9200,10.121.10.6:9200,10.121.10.7:9200";
//es index的mapping结构
static String index_mapping="\n" +
"{\n" +
" \"settings\": \n" +
" {\n" +
" \"number_of_replicas\": 3\n" +
" , \"number_of_shards\": 1\n" +
" ,\"max_result_window\" : 1000000\n" +
" }\n" +
" \n" +
" , \"mappings\": \n" +
" {\n" +
" \"properties\" : \n" +
" {\n" +
" \"amap_province_code\":\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" }\n" +
" ,\"amap_province_name\":\n" +
" {\n" +
" \"type\":\"keyword\"\n" +
" }\n" +
" ,\"amap_city_code\":\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"amap_city_name\":\n" +
" {\n" +
" \"type\":\"keyword\"\n" +
" }\n" +
" ,\"amap_district_code\":\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" }\n" +
" ,\"amap_district_name\" :\n" +
" {\n" +
" \"type\":\"keyword\"\n" +
" }\n" +
" ,\"group_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" }\n" +
" ,\"building_amap_address\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" }\n" +
" ,\"building_map\":\n" +
" {\n" +
" \"type\": \"geo_point\"\n" +
" } \t\n" +
"\t ,\"reside_rate\": \n" +
" {\n" +
" \"type\": \"double\"\n" +
" }\n" +
" ,\"rooms\":\n" +
" {\n" +
" \"type\": \"long\"\n" +
" }\n" +
" ,\"reside_rooms\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" }\n" +
" ,\"floor_num_min\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"floor_num_max\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"building_scope_code\":\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" }\n" +
" ,\"building_scope_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"audiences_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"audiences_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" }\n" +
" ,\"building_level_code\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"building_level_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"hire_price\" : \n" +
" {\n" +
" \"type\": \"double\"\n" +
" } \n" +
" ,\"sell_price\": \n" +
" {\n" +
" \"type\": \"double\"\n" +
" } \n" +
" ,\"open_quotation\":\n" +
" {\n" +
" \"type\": \"date\"\n" +
" , \"format\": \"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis\"\n" +
" }\n" +
"\n" +
" ,\"building_type_1_code\":\n" +
" {\n" +
" \"type\": \"long\" \n" +
" } \n" +
" ,\"building_type_1_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"reside_time\":\n" +
" {\n" +
" \"type\": \"date\"\n" +
" ,\"format\": \"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis\"\n" +
" }\n" +
" ,\"under_carport_count\":\n" +
" {\n" +
" \"type\": \"long\"\n" +
" }\n" +
" ,\"over_carport_count\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"surrounding_key\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"surroundings_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"surroundings\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"building_info_old_frame_taboos_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"building_info_old_frame_taboos_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \t\n" +
"\t \n" +
" ,\"zone_key\" : \n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"sub_building_id\":\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"sub_building_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" }\n" +
" ,\"emplacement_key\" : \n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"site_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"site_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"entrance_key\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"unit_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"unit_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"install_name_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"install_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"lcd_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"smart_amount\":\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"kids_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"juran_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_elevator_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_in_elevator_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_hall_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_in_hall_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \t\n" +
" ,\"frame1_amount\":\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame3_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_windows_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_jimi_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"frame_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"building_amount\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"event_day\": {\n" +
" \"type\": \"text\",\n" +
" \"fields\": {\n" +
" \"keyword\": {\n" +
" \"type\": \"keyword\",\n" +
" \"ignore_above\": 256\n" +
" }\n" +
" }\n" +
" },\n" +
" \"event_hour\": {\n" +
" \"type\": \"text\",\n" +
" \"fields\": {\n" +
" \"keyword\": {\n" +
" \"type\": \"keyword\",\n" +
" \"ignore_above\": 256\n" +
" }\n" +
" }\n" +
" },\n" +
" \"event_week\": {\n" +
" \"type\": \"long\"\n" +
" }\t\n" +
" ,\"locations_info\": \n" +
" {\n" +
" \"type\": \"nested\" ,\n" +
" \"properties\":\n" +
" {\n" +
"\t \"building_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"building_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" }\n" +
"\t\t ,\"stage_key\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"elevator_id\":\n" +
" {\n" +
" \"type\": \"long\"\n" +
" }\n" +
" ,\"elevator_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"elevator_purpose_code\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"elevator_purpose_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"location_attribute_code\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"location_attribute_name\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"stage_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"stage_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"location_key\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"location_id\" :\n" +
" {\n" +
" \"type\": \"long\"\n" +
" } \n" +
" ,\"device_cyber_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"device_sn\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"main_suit_kind_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"main_suit_kind_name\":\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"suit_kind_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"suit_kind_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"display_type_code\" :\n" +
" {\n" +
" \"type\": \"keyword\" \n" +
" } \n" +
" \n" +
" ,\"display_type_name\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"install_area_kind\":\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"install_area\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"install_location_kind\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"install_location\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"location_detail\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"location_desc\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"device_style_id\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"device_style_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"owner_company_code\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"owner_company_name\" :\n" +
" {\n" +
" \"type\": \"text\" , \"analyzer\": \"ik_smart\"\n" +
" } \n" +
" ,\"product_line_code\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" }\n" +
" ,\"install_status_code\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"install_status_name\" :\n" +
" {\n" +
" \"type\": \"keyword\"\n" +
" } \n" +
" ,\"is_sale\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"is_back\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \n" +
" ,\"is_available\" :\n" +
" {\n" +
" \"type\": \"integer\"\n" +
" } \t\t \n" +
" }\n" +
" } \n" +
" \n" +
" }\n" +
" \n" +
" }\n" +
"}";
public static void main( String[] args ) throws IOException
{
//创建索引
if(EsUtils.exsitsIndex(index))
EsUtils.deleteIndex(index);
EsUtils.creatIndex(index,index_mapping);
//spark导入数据
//tableToEs(String index,String index_auto_create,String es_mapping_id,String table_name,String es_nodes)
EsUtils.tableToEs(index,index_auto_create,es_mapping_id,table_name,es_nodes);
//flush下新的index数据
EsUtils.flushIndex(index);
//获取目前的索引别名对应的索引名字,该索引名马上要失效
String old_index=EsUtils.getAlias(index_alias);
//替换最新数据别名
EsUtils.indexUpdateAlias(index,index_alias);
//确认别名成功切换后清除老的索引
EsUtils.deleteIndex(old_index);
}
}
打成Jar包并部署
将调试无误的项目打成Jar包,如果还不会打Jar包,可以参考博客IntelliJ IDEA将代码打成Jar包的方式,这里我打成的Jar包名字为ETL_hive_to_es_galaxy.jar
;
将ETL_hive_to_es_galaxy.jar
上传到hdfs的/app/hive_to_es_galaxy/etl_jar/ETL_hive_to_es_galaxy.jar
路径下,然后写一个spark-submit
调用的shell脚本spark_on_hive_and_es.sh
,具体如下:
#!/bin/bash
cur_dir=`pwd`
spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m --class cn.focusmedia.esapp.App hdfs:///app/hive_to_es_galaxy/etl_jar/ETL_hive_to_es_galaxy.jar
dq_check_flag=$?
if [ $dq_check_flag -eq 0 ];then
echo "data quality check run succeed!"
else
echo "data quality check run failed!"
## 以下内容是我们设计的错误报错的钉钉报警,这里可以改成你们自己的报警措施
cd ${cur_dir}/../src/ding_talk_warning_report_py/main/
python3 ding_talk_with_agency.py 215
exit 3
fi
调度shell脚本
最后就是将这个spark_on_hive_and_es.sh
脚本调度起来,如用Azkaban
调度,设置自己需求的调度频率;
总 结
采用Spark将Hive表的数据写入ElasticSearch,速度较快,可以作为离线数据从数据仓库Hive写入ElasticSearch的首席参考方案,稳定,无缝连接,且快速;至于丢失的一环,如何校验Hive的数据是否准确的通过Spark写入了ES,请参考本文的目录的文章3;