此篇文章为 java读取elasticsearch全量数据并导出json文件 https://blog.csdn.net/zxt880610/article/details/106584343 的延伸
起因:要做一个ES数据同步到hdfs的需求,本地测试完成应用到生产后发现与测试环境有很大区别
1. 测试环境无账密 生产环境有账密 排除presto,暂没发现支持账密方式,如有大神知道请指教
2. 测试环境数据量小 生产环境数据量巨大 排除创建映射表方式,可能查挂ES集群
3. 测试环境为一个索引 生产环境由于数据量过大,每天创建一个新索引,以时间做后缀
由于以上几种场景,只能通过代码实现,创建java程序,编写完成后打成jar,上传到集群执行(由于工作原因待续。。。)
此前用idea打包,一直报错,没有主清单属性 与 Error: Invalid or corrupt jarfile 两种错误,放弃idea打包,改用maven插件打包成功。
在pom文件中加入以下插件,中包含三种打包方式
<build>
<plugins>
<!-- 打包时跳过单元测试 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.4.2</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<!-- 指定JDK版本及编码方式 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF8</encoding>
</configuration>
</plugin>
<!-- 源码打包 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>2.4</version>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<id>make-assembly</id>
<goals>
<goal>single</goal>
</goals>
<phase>package</phase>
</execution>
</executions>
<configuration>
<attach>false</attach>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.afanti.ScrollAll</mainClass> ----->此处放开主函数
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
加入插件后编译代码,在本地找到打包好的可运行jar包上传到服务器,在jar包目录下运行 java -jar 命令
此处错误为ES集群挂掉,重启ES集群后重新运行java -jar 命令
此时ES数据文件已写入到linux tmp/ 路径下
完成此步后,结合我之前文章按步骤 1.java读取elasticsearch全量数据并导出json文件
2.java打jar包引入本地依赖,执行java -jar 读取elasticsearch全量数据并导出json文件
3.MONGO数据同步到HIVE分区表实现增量同步
通过此三步,可实现 ES 大批量数据增量同步到hive仓库。
ES json文件样本:
{"serialVersionUID":1,"topic":"mw_20_108_test2","database":"dataworks","table":"global_val_test","type":"insert","ts":1585639455,"xid":6497354229,"commit":true,"data":"{\"update_time\":\"2020-01-09 06:46:54\",\"create_time\":\"2020-01-09 06:46:54\",\"name\
":\"updateMysqlMetadate\",\"id\":50,\"value\":\"10\"}"}
hive建表语句:
CREATE TABLE IF NOT EXISTS TMP_mongo_binloghash.TMP_binloghash
( json_data string
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001';
CREATE EXTERNAL TABLE IF NOT EXISTS mongo_binloghash.binloghash
( serialVersionUID string ,
topic string ,
database_str string ,
table_str string ,
type string ,
ts string ,
xid string ,
commit_str string ,
update_time string,
create_time string,
name string,
id string,
value string
)PARTITIONED BY(pk_year string, pk_month string, pk_day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;
解析临时表json数据,分区增量到正式表:
INSERT overwrite TABLE mongo_binloghash.binloghash partition (pk_year,pk_month,pk_day)
SELECT get_json_object(a.json_data,'$.serialVersionUID') AS serialVersionUID,
get_json_object(a.json_data,'$.topic') AS topic,
get_json_object(a.json_data,'$.database') AS database_str,
get_json_object(a.json_data,'$.table') AS table_str,
get_json_object(a.json_data,'$.type') AS type,
get_json_object(a.json_data,'$.ts') AS ts,
get_json_object(a.json_data,'$.xid') AS xid,
get_json_object(a.json_data,'$.commit') AS commit_str,
get_json_object((get_json_object(a.json_data,'$.data')),'$.update_time') AS update_time, ---> json数据样本中 data 嵌套拆分
get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time') AS create_time,
get_json_object((get_json_object(a.json_data,'$.data')),'$.name') AS name,
get_json_object((get_json_object(a.json_data,'$.data')),'$.id') AS id,
get_json_object((get_json_object(a.json_data,'$.data')),'$.value') AS value,
substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,4) AS pk_year, ---> 拆分data 后取其中create_time 为分区字段,实行截取
substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,7) AS pk_month,
substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,10) AS pk_day
FROM TMP_mongo_binloghash.TMP_binloghash a