本篇为第三篇,剩余请移步主页查看
本篇需要eclipse
三.Etl数据清洗
(1)在eclipse连接Hadoop,通过xshell进行连接,并进行Hadoop可视化
(2)在eclipse创建Etl mapreduce项目
(3)在eclipse进行编写NginxEtlMapper 和NginxETLDiver这两个类
NginxEtlMapper类代码:
package ETL;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class nginxetlmapper extends Mapper<LongWritable,Text,Text,NullWritable>{
private Text outputKey = new Text();
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String[] words = value.toString().split("");
String path = words[6];
outputKey.set(path);
context.write(outputKey,NullWritable.get());
}
}
NginxETLDiver类的代码:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class NginxEtlDriver {
public static void main(String[] args) {
if(args.length<2) {
System.out.println("请输入正确的参数");
return;
}
String day = args[0];
String hour = args[1];
Configuration conf = new Configuration();
try {
Job job = Job .getInstance(conf);
job.setJobName("nginx-etl");
job.setJarByClass(NginxEtlDriver.class);
job.setMapperClass(NginxEtlMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setNumReduceTasks(0);
Path inputPath = new Path("/web/log/" + day + "/" + hour);
FileInputFormat.addInputPath(job, inputPath);
Path outputPath = new Path("/web/log/etl/" + day + "/" + hour);
FileSystem.get(conf).delete(outputPath,true);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}catch(IOException e) {
e.printStackTrace();
}catch(InterruptedException e) {
e.printStackTrace();
}catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
(4)打包(jar)上传到/home/hadoop(Linux)下
打包步骤:
右键点击包含写的这两个类的包
选择图标为奶瓶的jar file,点击next
最后点击browse 点击finish结束
找到打包后的文件保存位置拖拽到xshell上传,上传成功如图:
(5)执行jar包 hadoop jar /home/hadoop/web_log/NginxETL.jar 21-12–13 09年月日)
(6)直接编写web_log_etl.sh脚本,内容如图
命令:
vim web_log_etl.sh
(7)运行脚本使用以下命令:
sh web_log_etl.sh 21-12-13 09