Hadoop之MapReduce02【自定义wordcount案例】

2401_85112052

于 2024-06-19 22:45:32 发布

阅读量524

点赞数 16

文章标签： hadoop python 大数据

本文链接：https://blog.csdn.net/2401_85112052/article/details/139814864

版权

/**

注意数据经过网络传输，所以需要序列化
KEYIN:默认是一行一行读取的偏移量 long LongWritable
VALUEIN:默认读取的一行的类型 String
KEYOUT:用户处理完成后返回的数据的key String LongWritable
VALUEOUT:用户处理完成后返回的value integer IntWritable
@author 波波烤鸭
```
 dengpbs@163.com
```

public class MyMapperTask extends Mapper<LongWritable, Text, Text, IntWritable> {

/**

Map阶段的业务逻辑写在Map方法中
默认是每读取一行记录就会调用一次该方法
@param key 读取的偏移量
@param value 读取的那行数据

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

// 根据空格切割单词

String[] words = line.split(" ");

for (String word : words) {

// 将单词作为key 将1作为值以便于后续的数据分发

context.write(new Text(word), new IntWritable(1));

}

创建ReduceTask

==========================================================================

创建java类继承自Reducer父类。

在这里插入图片描述

| 参数 | 说明 |

| — | :-- |

| KEYIN | 对应的是map阶段的 KEYOUT |

| VALUEIN | 对应的是map阶段的 VALUEOUT |

| KEYOUT | reduce逻辑处理的输出Key类型 |

| VALUEOUT | reduce逻辑处理的输出Value类型 |

/**

KEYIN和VALUEIN 对应的是map阶段的 KEYOUT和VALUEOUT
KEYOUT: reduce逻辑处理的输出类型
VALUEOUT:
@author 波波烤鸭
```
 dengpbs@163.com
```

public class MyReducerTask extends Reducer<Text, IntWritable, Text, IntWritable>{

/**

@param key map阶段输出的key
@param values map阶段输出的相同的key对应的数据集
@param context 上下文

@Override

protected void reduce(Text key, Iterable values,Context context) throws IOException, InterruptedException {

int count = 0 ;

// 统计同一个key下的单词的个数

for (IntWritable value : values) {

count += value.get();

}

context.write(key, new IntWritable(count));

}

创建启动工具类

=====================================================================

package com.bobo.mr.wc;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WcTest {

public static void main(String[] args) throws Exception {

// 创建配置文件对象

Configuration conf = new Configuration(true);

// 获取Job对象

Job job = Job.getInstance(conf);

// 设置相关类

job.setJarByClass(WcTest.class);

// 指定 Map阶段和Reduce阶段的处理类

job.setMapperClass(MyMapperTask.class);

job.setReducerClass(MyReducerTask.class);

// 指定Map阶段的输出类型

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

// 指定job的原始文件的输入输出路径通过参数传入

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 提交任务，并等待响应

job.waitForCompletion(true);

}

打包部署

==================================================================

maven打包为jar包

在这里插入图片描述

上传测试

==================================================================

在这里插入图片描述

在HDFS系统中创建wordcount案例文件夹，并测试

hadoop fs -mkdir -p /hdfs/wordcount/input

hadoop fs -put a.txt b.txt /hdfs/wordcount/input/

执行程序测试

hadoop jar hadoop-demo-0.0.1-SNAPSHOT.jar com.bobo.mr.wc.WcTest /hdfs/wordcount/input /hdfs/wordcount/output/

在这里插入图片描述

执行成功

[root@hadoop-node01 ~]# hadoop jar hadoop-demo-0.0.1-SNAPSHOT.jar com.bobo.mr.wc.WcTest /hdfs/wordcount/input /hdfs/wordcount/output/

19/04/03 16:56:43 INFO client.RMProxy: Connecting to ResourceManager at hadoop-node01/192.168.88.61:8032

19/04/03 16:56:46 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner t

o remedy this.19/04/03 16:56:48 INFO input.FileInputFormat: Total input paths to process : 2

19/04/03 16:56:49 INFO mapreduce.JobSubmitter: number of splits:2

19/04/03 16:56:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1554281786018_0001

19/04/03 16:56:52 INFO impl.YarnClientImpl: Submitted application application_1554281786018_0001

19/04/03 16:56:53 INFO mapreduce.Job: The url to track the job: http://hadoop-node01:8088/proxy/application_1554281786018_0001/

19/04/03 16:56:53 INFO mapreduce.Job: Running job: job_1554281786018_0001

19/04/03 16:57:14 INFO mapreduce.Job: Job job_1554281786018_0001 running in uber mode : false

19/04/03 16:57:14 INFO mapreduce.Job: map 0% reduce 0%

19/04/03 16:57:38 INFO mapreduce.Job: map 100% reduce 0%

19/04/03 16:57:56 INFO mapreduce.Job: map 100% reduce 100%

19/04/03 16:57:57 INFO mapreduce.Job: Job job_1554281786018_0001 completed successfully

19/04/03 16:57:57 INFO mapreduce.Job: Counters: 50

File System Counters

FILE: Number of bytes read=181

FILE: Number of bytes written=321388

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=325

HDFS: Number of bytes written=87

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=1

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=46511

Total time spent by all reduces in occupied slots (ms)=12763

Total time spent by all map tasks (ms)=46511

Total time spent by all reduce tasks (ms)=12763

Total vcore-milliseconds taken by all map tasks=46511

最后

按照上面的过程，4个月的时间刚刚好。当然Java的体系是很庞大的，还有很多更高级的技能需要掌握，但不要着急，这些完全可以放到以后工作中边用别学。

学习编程就是一个由混沌到有序的过程，所以你在学习过程中，如果一时碰到理解不了的知识点，大可不必沮丧，更不要气馁，这都是正常的不能再正常的事情了，不过是“人同此心，心同此理”的暂时而已。

“道路是曲折的，前途是光明的！”

time spent by all reduce tasks (ms)=12763

Total vcore-milliseconds taken by all map tasks=46511

最后

“道路是曲折的，前途是光明的！”

[外链图片转存中…(img-6nIKzwSf-1718808320239)]

[外链图片转存中…(img-4QfCA78S-1718808320240)]

2401_85112052

关注

16
点赞
踩
30

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫