MapReduce---招聘数据清洗(JSON类型数据)

数据及需求分析

数据样式

  • json类型数据
    在这里插入图片描述
    字段分析:从左到右分别是

id编号 公司名称 学历要求 工作类型 工作名称 薪资 发布时间 截止时间 城市编码 公司规模 福利 岗位职责 地区 工作经验

  • 城市数据
    在这里插入图片描述

城市id和城市名

需求及分析

  1. 处理工资,让其变成(最大-最小)/2
  2. 使用城市名替换城市id
  3. 每一个值都不能为空,只要有一个为空就删除整条数据
  • 分析:使用Fastjson将json转换成对象,使用MapJoin可以实现需求二

代码实现阶段

自定义的对象

  • 因为要实现json格式的转换,属性名要和json的k一样,实现WritableComparable因为需要序列化
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Data implements WritableComparable<Data> {
    private int id;
    private String company_name;
    private String eduLevel_name;
    private String emplType;
    private String jobName;
    private String salary;
    private String createDate;
    private String endDate;
    private int city_code;
    private String companySize;
    private String welfare;
    private String responsibility;
    private String place;
    private String workingExp;
    private String city_name;

    @Override
    public int compareTo(Data o) {
        return 0;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(id);
        dataOutput.writeUTF(company_name);
        dataOutput.writeUTF(eduLevel_name);
        dataOutput.writeUTF(emplType);
        dataOutput.writeUTF(jobName);
        dataOutput.writeUTF(salary);
        dataOutput.writeUTF(createDate);
        dataOutput.writeUTF(endDate);
        dataOutput.writeInt(city_code);
        dataOutput.writeUTF(companySize);
        dataOutput.writeUTF(welfare);
        dataOutput.writeUTF(responsibility);
        dataOutput.writeUTF(place);
        dataOutput.writeUTF(workingExp);
        dataOutput.writeUTF(city_name);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        id = dataInput.readInt();
        company_name = dataInput.readUTF();
        eduLevel_name = dataInput.readUTF();
        emplType = dataInput.readUTF();
        jobName = dataInput.readUTF();
        salary = dataInput.readUTF();
        createDate = dataInput.readUTF();
        endDate = dataInput.readUTF();
        city_code = dataInput.readInt();
        companySize = dataInput.readUTF();
        welfare = dataInput.readUTF();
        responsibility = dataInput.readUTF();
        place = dataInput.readUTF();
        workingExp = dataInput.readUTF();
        city_name = dataInput.readUTF();
    }

    @Override
    public String toString() {
        return "Data{" +
                "id=" + id +
                ", company_name='" + company_name + '\'' +
                ", eduLevel_name='" + eduLevel_name + '\'' +
                ", emplType='" + emplType + '\'' +
                ", jobName='" + jobName + '\'' +
                ", salary='" + salary + '\'' +
                ", createDate='" + createDate + '\'' +
                ", endDate='" + endDate + '\'' +
                ", city_name='" + city_name + '\'' +
                ", companySize='" + companySize + '\'' +
                ", welfare='" + welfare + '\'' +
                ", responsibility='" + responsibility + '\'' +
                ", place='" + place + '\'' +
                ", workingExp='" + workingExp + '\'' +
                '}';
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getCompany_name() {
        return company_name;
    }

    public void setCompany_name(String company_name) {
        this.company_name = company_name;
    }

    public String getEduLevel_name() {
        return eduLevel_name;
    }

    public void setEduLevel_name(String eduLevel_name) {
        this.eduLevel_name = eduLevel_name;
    }

    public String getEmplType() {
        return emplType;
    }

    public void setEmplType(String emplType) {
        this.emplType = emplType;
    }

    public String getJobName() {
        return jobName;
    }

    public void setJobName(String jobName) {
        this.jobName = jobName;
    }

    public String getSalary() {
        return salary;
    }

    public void setSalary(String salary) {
        this.salary = salary;
    }

    public String getCreateDate() {
        return createDate;
    }

    public void setCreateDate(String createDate) {
        this.createDate = createDate;
    }

    public String getEndDate() {
        return endDate;
    }

    public void setEndDate(String endDate) {
        this.endDate = endDate;
    }

    public int getCity_code() {
        return city_code;
    }

    public void setCity_code(int city_code) {
        this.city_code = city_code;
    }

    public String getCompanySize() {
        return companySize;
    }

    public void setCompanySize(String companySize) {
        this.companySize = companySize;
    }

    public String getWelfare() {
        return welfare;
    }

    public void setWelfare(String welfare) {
        this.welfare = welfare;
    }

    public String getResponsibility() {
        return responsibility;
    }

    public void setResponsibility(String responsibility) {
        this.responsibility = responsibility;
    }

    public String getPlace() {
        return place;
    }

    public void setPlace(String place) {
        this.place = place;
    }

    public String getWorkingExp() {
        return workingExp;
    }

    public void setWorkingExp(String workingExp) {
        this.workingExp = workingExp;
    }

    public String getCity_name() {
        return city_name;
    }

    public void setCity_name(String city_name) {
        this.city_name = city_name;
    }
}

Map阶段

  • Map阶段中的setup方法将城市表缓存到内存里,然后实现替换的操作,先使用JSONObject来存储,判断是否全部为空,如果是的话,在进行下一步的处理。将salary分割,然后按要求进行处理
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

public class MapTest extends Mapper<LongWritable, Text, Data, NullWritable> {
    private Data k = new Data();
    private Map<Integer, String> city = new HashMap();
    private int city_code;
    private int salary;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] uris = context.getCacheFiles();
        File file = new File(uris[0]);
        BufferedReader br = new BufferedReader(new FileReader(file));
        String line;
        while ((line = br.readLine()) != null) {
            city.put(Integer.parseInt(line.split(",")[0]), line.split(",")[1]);
        }
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        JSONObject jsonObject = JSONObject.parseObject(value.toString());
        String [] datas = new String[14];
        datas[0] = jsonObject.getString("id");
        datas[1] = jsonObject.getString("company_name");
        datas[2] = jsonObject.getString("eduLevel_name");
        datas[3] = jsonObject.getString("emplType");
        datas[4] = jsonObject.getString("jobName");
        datas[5] = jsonObject.getString("salary");
        datas[6] = jsonObject.getString("createDate");
        datas[7] = jsonObject.getString("endDate");
        datas[8] = jsonObject.getString("city_code");
        datas[9] = jsonObject.getString("companySize");
        datas[10] = jsonObject.getString("welfare");
        datas[11] = jsonObject.getString("responsibility");
        datas[12] = jsonObject.getString("place");
        datas[13] = jsonObject.getString("workingExp");
        for (String s:datas){
            if (s.equals("")||s==null){
                return;
            }
        }
        k = JSON.parseObject(value.toString(),Data.class);
        city_code = k.getCity_code();
        k.setCity_name(city.get(city_code));
        String s [] = k.getSalary().split("-");
        salary = Integer.parseInt(s[1].substring(s[1].length()-2,s[1].length()-1))-Integer.parseInt(s[0].substring(s[0].length()-2,s[0].length()-1));
        k.setSalary(String.valueOf(salary));
        context.write(k,NullWritable.get());
    }
}

Reduce阶段

  • 只要循环输出所有的结果即可
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class RedTest extends Reducer<Data, NullWritable, Data, NullWritable> {
    @Override
    protected void reduce(Data key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        for (NullWritable v : values) {
            context.write(key, NullWritable.get());
        }
    }
}

Driver阶段

  • 注意输入要缓存的文件的路径,即城市表的路径位置
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.File;
import java.net.URI;

public class DriTest {
    public static void main(String[] args) throws Exception {
        File file = new File("D:\\MP\\招聘数据\\output");
        if (file.exists()) {
            delFile(file);
            driver();
        } else {
            driver();
        }
    }

    public static void delFile(File file) {
        File[] files = file.listFiles();
        if (files != null && files.length != 0) {
            for (int i = 0; i < files.length; i++) {
                delFile(files[i]);
            }
        }
        file.delete();
    }

    public static void driver() throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setMapperClass(MapTest.class);
        job.setJarByClass(DriTest.class);
        job.setReducerClass(RedTest.class);

        job.setMapOutputKeyClass(Data.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(Data.class);
        job.setOutputValueClass(NullWritable.class);

        job.addCacheFile(new URI("file:///D:/MP/招聘数据/input/com.txt"));

        FileInputFormat.setInputPaths(job, "D:\\MP\\招聘数据\\input\\data.json");
        FileOutputFormat.setOutputPath(job, new Path("D:\\MP\\招聘数据\\output"));
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}
  • 1
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
hadoop-mapreduce-client-core是Hadoop分布式计算框架中的核心模块之一。它主要包含了Hadoop MapReduce的核心功能和API接口,是实现MapReduce编程模型的必备组件。 Hadoop MapReduce是一种用于大规模数据处理的编程模型,其核心思想是将大规模数据集分解成多个较小的数据块,分别在集群中的不同机器上进行处理,最后将结果整合。hadoop-mapreduce-client-core模块提供了与MapReduce相关的类和方法,方便开发者实现自定义的Map和Reduce任务。 具体来说,hadoop-mapreduce-client-core模块包含了以下重要组件和功能: 1. Job:Job表示一个MapReduce任务的定义和描述,包括输入路径、输出路径、Mapper和Reducer等。 2. Mapper:Mapper是MapReduce任务中的映射函数,它负责将输入数据转换成<key, value>键值对的形式。 3. Reducer:Reducer是MapReduce任务中的归约函数,它按照相同的key将所有Mapper输出的value进行聚合处理。 4. InputFormat:InputFormat负责将输入数据切分成多个InputSplit,每个InputSplit由一个Mapper负责处理。 5. OutputFormat:OutputFormat负责将Reducer的输出结果写入指定的输出路径中。 使用hadoop-mapreduce-client-core模块,开发者可以基于Hadoop分布式计算框架快速开发并行处理大规模数据应用程序。通过编写自定义的Mapper和Reducer,可以实现各种类型的分布式计算,如数据清洗、聚合分析、机器学习等。 总之,hadoop-mapreduce-client-core是Hadoop分布式计算框架中的核心模块,提供了实现MapReduce编程模型所需的基本功能和API接口。使用该模块,开发者可以利用Hadoop的分布式计算能力,高效地处理和分析大规模数据

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值