hadoop MapReudce入门程序

最新推荐文章于 2022-07-13 15:45:32 发布

truelove1030

最新推荐文章于 2022-07-13 15:45:32 发布

阅读量1.1k

点赞数

分类专栏： hadoop MapReudce入门程序文章标签： hadoop MapReudce 例子 MapReudce入门

hadoop MapReudce入门程序专栏收录该内容

1 篇文章 0 订阅

订阅专栏

需求：

          假设有两个表格，均以.txt 文件存储：
     (1) 商品表(trade table)，表格名称包含“action”字段，每行为一条数据，分隔符为““”，对应格式如下：
                          产品ID1”所属商品ID1
                          产品ID2”所属商品ID2
                        产品IDn”所属商品IDn
         (2) 支付表(pay table)，文件名称包含“alipay”，每行为一条数据，分隔符为““”，，对应格式如下：
                         产品ID1”支付ID1
                         产品ID2”支付ID2

产品IDn”支付IDn
目标：

             将这两个表格根据相同的产品ID 链接起来，生成新的key/value pair：
                         <所属商品ID1, 支付ID1>
                         <所属商品ID2, 支付ID2>
                          ……
                         <所属商品IDn, 支付IDn>

解决思路：

思路说明：

                  两个表格中都包含“产品ID”，且均为主键，所以只需要将“产品ID”对应的“所属
                  商品ID”及“支付表ID”对应起来就可以。“产品ID”是链接的关键，可以实现“所属商
                  品ID”及“支付ID”的join。
                  由于MapReduce 的本质是map 阶段分发数据，在Reduce 阶段收集相同key 对应的value，
                  因此在这个问题上，可以在map 阶段，将两个表格数据根据“产品ID”这个key 分发出去，
              将“商品ID”及“支付ID”封装一下，作为map 阶段的value，这样在reduce 阶段就可以
                 根据“产品ID”得到对应的“商品ID”及“支付ID”了。

实现代码：

（1）Mapper 的实现：

public class PreMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
	public void map(LongWritable key, Text val,
			OutputCollector<Text, Text> output, Reporter reporter)
			throws IOException {
		Text kr = new Text();
		Text kv = new Text();
		// 获得输入文件的路径名
		String path = ((FileSplit) reporter.getInputSplit()).getPath()
				.toString();
		String[] line = val.toString().split("\"");
		if (line.length < 2) { // skip bad value
			return;
		}
		String productID = line[0];
		kr.set(productID); // key is product ID;
		if (path.indexOf("action") >= 0) { // 数据来自商品表
			String tradeID = line[1];
			kv.set("supid" + "\"" + tradeID);// 商品ID 的组合方式：加上supid 前缀
		} else if (path.indexOf("alipay") >= 0) { // 数据来自支付表
			String payID = line[1];
			kv.set("buyid" + "\"" + payID); // 支付ID 的组合方式：加上buyid 前缀；
		}
		output.collect(kr, kv);
	}
}

（2）Reduce 的实现

public class CommonReduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
	public void reduce(Text key, Iterator<Text> values,
			OutputCollector<Text, Text> output, Reporter reporter)
			throws IOException {
		String spuid = "";
		String buyerid = "";
		// key 是产品ID，遍历相同key 对应的数据集合
		while (values.hasNext()) {
			String value = values.next().toString();
			int index = value.indexOf('"');
			if (index == -1) // skip bad record
				continue;
			String subValue = value.substring(index + 1, value.length());
			if (value.startsWith("supid")) { // 是商品ID
				spuid = subValue;
			} else if (value.startsWith("buyid")) { // 是支付ID
				buyerid = subValue;
			}
		}
		// 同时含有商品ID 及支付ID 时，将这两个数据写入output
		if (!spuid.equals("") && !buyerid.equals("")) {
			output.collect(new Text(spuid), new Text(buyerid));
		}
	}
}

（3）运行的类

public class Join {
	public static void main(String[] args) throws IOException {
		if (args.length < 3) { // 调用的usage
			System.err
					.println("Usage: Join <tradeTableDir> <payTableDir> <output>");
			System.exit(-1);
		}
		String tradeTableDir = args[0];
		String payTableDir = args[1];
		String joinTableDir = args[2];
		setJobName("join two tables");
		// 定义输入: 商品表& 支付表
		FileInputFormat.addInputPath(conf, new Path(tradeTableDir));
		FileInputFormat.addInputPath(conf, new Path(payTableDir));
		conf.setInputFormat(TextInputFormat.class); // 输入文件格式
		conf.setMapperClass(PreMapper.class); // 指定Mapper
		conf.setOutputKeyClass(Text.class); // 输出文件的key
		conf.setOutputValueClass(Text.class); // 输出文件的mapper
		conf.setReducerClass(CommonReduce.class); // 指定Reducer
		conf.setOutputFormat(TextOutputFormat.class); // 输出文件格式
		// 定义output
		FileOutputFormat.setOutputPath(conf, new Path(joinTableDir));
		JobClient.runJob(conf); // 执行job
	}
}

打包说明：

选择工程--->右击--->export

然后finish，忽略一切警告信息。

运行方式

1、将你的jar放到有hadoop环境下，不是hdfs目录，就是普通的路劲

2、在hadoop的hdfs下创建2个输入文件，一个命名为action，另一个为alipay

3、将2个文件拷贝到hdfs目录下。

4、执行命令

5、在你设置的output路劲下应该能看到par-00000文件，证明成功了

6、如果实在不清楚怎么运行，请百度一下，很多地方有写，要下班了。。。。。。。

truelove1030

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
hadoop MapReudce入门程序

需求：假设有两个表格，均以.txt 文件存储： (1) 商品表(trade table)，表格名称包含“action”字段，每行为一条数据，分隔符为““”，对应格式如下：产品ID1”所属商品ID1 产品ID2”所属商品ID2
复制链接

扫一扫

专栏目录