mapreduce几种常见的inputformat实现类

我们在使用mapreudce来处理数据的时候会遇到许多类型的数据,如文本数据,二进制数据,数据库表等文件,mapreduce中对应其中常用的数据格式有许多输入类来对应实现这些输入格式数据的输入。

注意这几种输入格式实现类的切片

**TextInputFormat:**按文件切片,一个文件大了切片,但是再小也是一个切片

**KeyValue:**同TextInputFormat

**NLineInputformat:**按文件行数切片,N行切一片。

**CombineTextInputFormat:**设置文件最小值,文件小于最小值,合并切片。

**自定义Inputformat:**同TextInputFormat,只不过自定义设置了Kv。

TextInputFormat

这大概是最为常用的inputformat输入类了,用来读取文本数据。

按行读取每条记录键值为在整个文件中的字节偏移量,为Longwriteable类型。值为这行的内容,不包括任何终止符,为Text类型。

KeyValueTextInputFormat

每一行均为一条记录,被分割符分割为键和值,可以在驱动类中自定义分隔符使用:

conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR," ");

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nwi5lFby-1603113769548)(https://s1.ax1x.com/2020/10/18/0juD10.png)]

NLineInputFormat

使用NLineInputFormat来指定用来分配给每个maptask任务的文本行数,即指每个inpusplit不在按照blocksize来划分,而是按照指定的行数及逆行划分。

即输入文件的总行数/N=切片数,如若不整除,切片数 = 商+1

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C6Chs3Gq-1603113769551)(https://s1.ax1x.com/2020/10/18/0juONd.png)]

这里的键和值和TextInputFormat生成的一样。

KeyValueInputFormat实例

统计输入文件中每一行的第一个单词相同的行数。

输入数据:

banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang

期望:

banzhang	2
xihuan	2

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jc5wCn6x-1603113769554)(https://s1.ax1x.com/2020/10/18/0jl1Ug.png)]

mapper
package com.map.kv;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class KVTextMapper extends Mapper<Text,Text,Text, LongWritable> {

    LongWritable v = new LongWritable(1);

    @Override
    protected void map(Text key,Text value,Context context) throws IOException, InterruptedException {
        context.write(key,v);
    }
}
Reducer
package com.map.kv;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class KVTextReducer extends Reducer<Text, LongWritable,Text,LongWritable> {

    LongWritable v = new LongWritable();

    @Override
    protected void reduce(Text key,Iterable<LongWritable> values,Context context) throws IOException, InterruptedException {
        long sum = 0L;

        for (LongWritable value : values) {
            sum+=value.get();
        }

        v.set(sum);

        context.write(key,v);
    }
}
driver
package com.map.kv;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class KVTextDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"D:/mapreduceinput/input1","D:/mapreduceoutput/output1"};
        Configuration conf  = new Configuration();
        conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR," ");

        Job job = Job.getInstance(conf);

        job.setJarByClass(KVTextDriver.class);
        job.setMapperClass(KVTextMapper.class);
        job.setReducerClass(KVTextReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job,new Path(args[0]));
        job.setInputFormatClass(KeyValueTextInputFormat.class);

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        job.waitForCompletion(true);
    }
}

NLineInputFormat实例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qaqhxxy4-1603113769556)(https://s1.ax1x.com/2020/10/18/0jlsPJ.png)]

mapper
package com.atguigu.mapreduce.nline;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class NLineMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
	
	private Text k = new Text();
	private LongWritable v = new LongWritable(1);
	
	@Override
	protected void map(LongWritable key, Text value, Context context)	throws IOException, InterruptedException {
		
		 // 1 获取一行
        String line = value.toString();
        
        // 2 切割
        String[] splited = line.split(" ");
        
        // 3 循环写出
        for (int i = 0; i < splited.length; i++) {
        	
        	k.set(splited[i]);
        	
           context.write(k, v);
        }
	}
}
reducer
package com.atguigu.mapreduce.nline;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class NLineReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
	
	LongWritable v = new LongWritable();
	
	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,	Context context) throws IOException, InterruptedException {
		
        long sum = 0l;

        // 1 汇总
        for (LongWritable value : values) {
            sum += value.get();
        }  
        
        v.set(sum);
        
        // 2 输出
        context.write(key, v);
	}
}
driver
package com.atguigu.mapreduce.nline;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NLineDriver {
	
	public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
		
// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
		args = new String[] { "e:/input/inputword", "e:/output1" };
        // 1 获取job对象
		 Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        
        // 7设置每个切片InputSplit中划分三条记录
        NLineInputFormat.setNumLinesPerSplit(job, 3);
          
        // 8使用NLineInputFormat处理记录数  
        job.setInputFormatClass(NLineInputFormat.class);  
          
        // 2设置jar包位置,关联mapper和reducer
        job.setJarByClass(NLineDriver.class);  
        job.setMapperClass(NLineMapper.class);  
        job.setReducerClass(NLineReducer.class);  
        
        // 3设置map输出kv类型
        job.setMapOutputKeyClass(Text.class);  
        job.setMapOutputValueClass(LongWritable.class);  
        
        // 4设置最终输出kv类型
        job.setOutputKeyClass(Text.class);  
        job.setOutputValueClass(LongWritable.class);  
          
        // 5设置输入输出数据路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));  
        FileOutputFormat.setOutputPath(job, new Path(args[1]));  
          
        // 6提交job
        job.waitForCompletion(true);  
	}
}
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值