java打jar包引入本地依赖，执行java -jar 读取elasticsearch全量数据并导出json文件

最新推荐文章于 2023-12-12 20:49:53 发布

逮皮皮虾户

最新推荐文章于 2023-12-12 20:49:53 发布

阅读量716

点赞数 1

本文链接：https://blog.csdn.net/zxt880610/article/details/106895939

版权

此篇文章为 java读取elasticsearch全量数据并导出json文件 https://blog.csdn.net/zxt880610/article/details/106584343 的延伸

起因：要做一个ES数据同步到hdfs的需求，本地测试完成应用到生产后发现与测试环境有很大区别

1. 测试环境无账密生产环境有账密排除presto，暂没发现支持账密方式，如有大神知道请指教

2. 测试环境数据量小生产环境数据量巨大排除创建映射表方式，可能查挂ES集群

3. 测试环境为一个索引生产环境由于数据量过大，每天创建一个新索引，以时间做后缀

由于以上几种场景，只能通过代码实现，创建java程序，编写完成后打成jar，上传到集群执行（由于工作原因待续。。。）

此前用idea打包，一直报错，没有主清单属性与 Error: Invalid or corrupt jarfile 两种错误，放弃idea打包，改用maven插件打包成功。

在pom文件中加入以下插件，中包含三种打包方式

<build>
  <plugins>
    <!-- 打包时跳过单元测试 -->
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-surefire-plugin</artifactId>
      <version>2.4.2</version>
      <configuration>
        <skip>true</skip>
      </configuration>
    </plugin>

    <!-- 指定JDK版本及编码方式 -->
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-compiler-plugin</artifactId>
      <version>3.1</version>
      <configuration>
        <source>1.8</source>
        <target>1.8</target>
        <encoding>UTF8</encoding>
      </configuration>
    </plugin>

    <!-- 源码打包 -->
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-source-plugin</artifactId>
      <version>2.4</version>
      <executions>
        <execution>
          <id>attach-sources</id>
          <goals>
            <goal>jar</goal>
          </goals>
        </execution>
      </executions>
    </plugin>

    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-assembly-plugin</artifactId>
      <executions>
        <execution>
          <id>make-assembly</id>
          <goals>
            <goal>single</goal>
          </goals>
          <phase>package</phase>
        </execution>
      </executions>
      <configuration>
        <attach>false</attach>
        <descriptorRefs>
          <descriptorRef>jar-with-dependencies</descriptorRef>
        </descriptorRefs>
        <archive>
            <manifest>
                <mainClass>com.afanti.ScrollAll</mainClass>      ----->此处放开主函数
            </manifest>
        </archive>
      </configuration>
    </plugin>
  </plugins>
</build>

加入插件后编译代码，在本地找到打包好的可运行jar包上传到服务器，在jar包目录下运行 java -jar 命令

此处错误为ES集群挂掉，重启ES集群后重新运行java -jar 命令

此时ES数据文件已写入到linux tmp/ 路径下

完成此步后，结合我之前文章按步骤 1.java读取elasticsearch全量数据并导出json文件

2.java打jar包引入本地依赖，执行java -jar 读取elasticsearch全量数据并导出json文件

3.MONGO数据同步到HIVE分区表实现增量同步

通过此三步，可实现 ES 大批量数据增量同步到hive仓库。

ES json文件样本：

{"serialVersionUID":1,"topic":"mw_20_108_test2","database":"dataworks","table":"global_val_test","type":"insert","ts":1585639455,"xid":6497354229,"commit":true,"data":"{\"update_time\":\"2020-01-09 06:46:54\",\"create_time\":\"2020-01-09 06:46:54\",\"name\
":\"updateMysqlMetadate\",\"id\":50,\"value\":\"10\"}"}

hive建表语句：

CREATE TABLE IF NOT EXISTS TMP_mongo_binloghash.TMP_binloghash
( json_data string
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001';

CREATE EXTERNAL TABLE IF NOT EXISTS mongo_binloghash.binloghash
( serialVersionUID string ,
topic string ,
database_str string ,
table_str string ,
type string ,
ts string ,
xid string ,
commit_str string ,
update_time string,
create_time string,
name string,
id string,
value string
)PARTITIONED BY(pk_year string, pk_month string, pk_day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;

解析临时表json数据，分区增量到正式表：

INSERT overwrite TABLE mongo_binloghash.binloghash partition (pk_year,pk_month,pk_day)

SELECT get_json_object(a.json_data,'$.serialVersionUID') AS serialVersionUID,

get_json_object(a.json_data,'$.topic') AS topic,

get_json_object(a.json_data,'$.database') AS database_str,

get_json_object(a.json_data,'$.table') AS table_str,

get_json_object(a.json_data,'$.type') AS type,

get_json_object(a.json_data,'$.ts') AS ts,

get_json_object(a.json_data,'$.xid') AS xid,

get_json_object(a.json_data,'$.commit') AS commit_str,

get_json_object((get_json_object(a.json_data,'$.data')),'$.update_time') AS update_time, ---> json数据样本中 data 嵌套拆分

get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time') AS create_time,

get_json_object((get_json_object(a.json_data,'$.data')),'$.name') AS name,

get_json_object((get_json_object(a.json_data,'$.data')),'$.id') AS id,

get_json_object((get_json_object(a.json_data,'$.data')),'$.value') AS value,

substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,4) AS pk_year, ---> 拆分data 后取其中create_time 为分区字段，实行截取

substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,7) AS pk_month,

substring(get_json_object((get_json_object(a.json_data,'$.data')),'$.create_time'),1,10) AS pk_day

FROM TMP_mongo_binloghash.TMP_binloghash a

逮皮皮虾户

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
java打jar包引入本地依赖，执行java -jar 读取elasticsearch全量数据并导出json文件

起因：要做一个ES数据同步到hdfs的需求，本地测试完成应用到生产后发现与测试环境有很大区别 1. 测试环境无账密生产环境有账密排除presto，暂没发现支持账密方式，如有大神知道请指教 2. 测试环境数据量小生产环境数据量巨大排除创建映射表方式，可能查挂ES集群 3. 测试环境为一个索引生产环境由于数据量过大，每天创...
复制链接

扫一扫