0 环境介绍
Elasticsearch:6.4.2
Hadoop:2.7.6
提前准备好数据,在ES中创建相应的index和type,并创建document。
1 两种方案:硬编码与配置
-
- 采用编码的方式,引入es-hadoop.jar包
将elasticsearch-hadoop引入项目。
-
-
- 引入jar
-
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>es.hdfs</groupId>
<artifactId>Es_Hdfs</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.version>2.7.3</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-server-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-server-resourcemanager</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-server-nodemanager</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-server-applicationhistoryservice</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</project>
-
-
- 编码
- 创建elasticsearch到hadoop的数据迁移的Mapper类
- 编码
-
package package1;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Mapper;
import org.elasticsearch.hadoop.mr.LinkedMapWritable;
public class E2HMapper01 extends Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable>{
@Override
public void run(Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.run(context);
}
@Override
protected void setup(Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
}
@Override
protected void cleanup(Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.cleanup(context);
}
@Override
protected void map(Text key, LinkedMapWritable value, Mapper<Text, LinkedMapWritable, Text, LinkedMapWritable>.Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
这个Mapper非常简单,它并没有对从ES获取的数据进行任何的处理,只是写到了context中。map方法中,参数key的值,就是ES中document的id的值,参数value是一个LinkedMapWritable,它包含的就是一个document的内容。只是在这个mapper中,我们没有处理document,而是直接输出。
-
-
-
- 创建elasticsearch到hadoop的数据迁移job类。
-
-
package package1;
import java.io.File;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.elasticsearch.hadoop.mr.EsInputFormat;
import org.elasticsearch.hadoop.mr.LinkedMapWritable;
public class ES2HadoopJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setBoolean("mapred.map.tasks.speculative.execution", false);
conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);
conf.set("es.nodes", "192.168.115.73:9200");
conf.set("es.resource", "tradeindex/tradeType");
Job job = Job.getInstance(conf);
job.setJarByClass(ES2HadoopJob.class);
job.setInputFormatClass(EsInputFormat.class);
job.setMapperClass(E2HMapper01.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LinkedMapWritable.class);
// FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.115.73:9000/es2hdfs"));
FileOutputFormat.setOutputPath(job, new Path(args[0]));
System.out.println(job.waitForCompletion(true));
}
}
1、它没有reducer,因为就是数据的透传,不需要reduce过程。
2、InputFormatClass被设置为EsInputFormat,正是这个类,负责将从ES读出的数据,转换成mapper的输入参数(Text,LinkedMapWritable)
-
-
- 打包运行
-
将上面的代码打印成jar,放到linux上运行,启动MapReduce任务
hadoop jar es-hadoop-1.0-SNAPSHOT.jar package1.E2HJob01 hdfs://xfraud:8020/estohdfs
-
- 采用repository-hdfs的插件,通过配置实现
- 安装插件
- 采用repository-hdfs的插件,通过配置实现
安装插件存在两种方式,一种是在线安装,一种是离线安装。
-
-
-
- 在线安装
-
-
- 进入elasticsearch目录,执行 bin/elasticsearch-plugin install repository-hdfs
(移除:./bin/elasticsearch-plugin remove repository-hdfs)
-
-
-
- 离线安装
-
-
该插件通知支持离线安装,下载地址[HDFS快照离线安装](https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-hdfs/repository-hdfs-5.2.2.zip.)
下载对应的版本。
解压后得到elasticsearch文件夹,将该文件夹移动到es安装目录下plugins文件夹下并改名为repository-hdfs
如果集群设置了权限,需要进行安全修改
vim plugins/repository-hdfs/plugin-security.policy
添加下列信息
permission java.lang.RuntimePermission "accessDeclaredMembers";
permission java.lang.RuntimePermission "getClassLoader";
permission java.lang.RuntimePermission "shutdownHooks";
permission java.lang.reflect.ReflectPermission "suppressAccessChecks";
permission javax.security.auth.AuthPermission "doAs";
permission javax.security.auth.AuthPermission "getSubject";
permission javax.security.auth.AuthPermission "modifyPrivateCredentials";
permission java.security.AllPermission;
permission java.util.PropertyPermission "*", "read,write";
permission javax.security.auth.PrivateCredentialPermission "org.apache.hadoop.security.Credentials * \"*\"", "read";
vim conf/jvm.options
添加下列信息
-Djava.security.policy=file:usr/local/elasticsearch-5.4.0/plugins/repository-hdfs/plugin-security.policy
1
每个节点都进行上述操作
重启集群
-
-
- 创建仓库快照
-
curl -H "Content-Type: application/json" -XPUT "http://192.168.159.128:9200/_snapshot/trade_backup" -d "{"""type""":"""hdfs""","""settings""":{ """path""":"""/user/ysl"""","""uri""":"""hdfs://192.168.159.128:9000"""}}"
配置项:
uri:HDFS的路径,如:“hdfs://:/”。(必须的)
path:数据存储的路径,如:“path/to/file”。(必须的)
load_defaults:是否加载默认的Hadoop配置(默认允许)。
conf.<key>:键入参数以用于添加到Hadoop配置。其中配置项只能调整改插件认证的Hadoop内核设置。
compress:是否压缩数据(默认为否)
chunk_size:重写数据块大小(默认为否)
ps:查看仓库 curl -XGET http://192.168.159.128:9200/_snapshot/trade_backup?pretty
-
-
- 备份索引
-
在创建的快照基础上,快照自己需要的索引。
curl -H "Content-Type: application/json" -XPUT "http://192.168.159.128:9200/_snapshot/trade_backup/snapshot_2" -d "{\"indices\":\"myblog,tradeindex\"}"
ps:在这一步中就能看见hdfs中多了备份文件。
Ps: 查看快照索引的信息
INITIALIZING:分片在检查集群状态看看自己是否可以被快照。这个一般是非常快的。
STARTED:数据正在被传输到仓库。
FINALIZING:数据传输完成;分片现在在发送快照元数据。
DONE:快照完成!
FAILED:快照处理的时候碰到了错误,这个分片/索引/快照不可能完成了。检查你的日志获取更多信息
定时运行备份
利用crontab机制在Linux上编写自动化运行脚本。
编写自动化运行脚本
#!/bin/bash
current_time=$(date +%Y%m%d%H%M%S)
command_prefix="http://192.168.159.128:9200/_snapshot/trade_backup/snapshot_"
command=$command_prefix$current_time
curl -H "Content-Type: application/json" -XPUT $command -d "{\"indices\":\"myblog,tradeindex\"}"
-
-
-
- crontab定时设置
-
-
定时设置:crontab -e
一分钟一次,日志打印,没有文件夹需要创建文件夹,注意权限
*/1 * * * * sh /chunlai/snapshot_all_hdfs.sh>>/chunlai/snapshot_all_day.log 2>&1
一分钟一次成功
-
-
- 参考的时间配置
-
每五分钟执行 */5 * * * *
每小时执行 0 * * * *
每天执行 0 0 * * *
每周执行 0 0 * * 0
每月执行 0 0 1 * *
每年执行 0 0 1 1 *
- 验证查看
有两种方式。
-
- 通过输入指令 hadoop fs -ls /user/ysl
-
- 在hadoop的50070网址中可以看到
参考文档:
https://blog.csdn.net/ysl1242157902/article/details/79219061
https://blog.csdn.net/u011937566/article/details/83652831
https://blog.csdn.net/m0_37895851/article/details/81206701