Linux 单词计数 WordCount 以及代码案例

最新推荐文章于 2022-07-18 19:58:40 发布

_苏小白

最新推荐文章于 2022-07-18 19:58:40 发布

阅读量1.8k

点赞数 2

本文链接：https://blog.csdn.net/qq_36074043/article/details/78622862

版权

WordCount

首先是命令行的：

WordCount(单词计数)

1:启动hadoop 使用 start-all.sh 命令启动hdfs

2:在hadoop的安装目录下新建一个目录，使用hdfs的shell命令

cd /usr/local/hadoop-2.8.0 切换目录

hdfs fs -mkdir /input

3:hadoop fs -put LICENSE.txt /input 将hadoop安装目录下的LICENSE.txt 文件放入到input文件夹中

4:使用hadoop fs -ls /input 查看input目录下是否成功放入!

5:执行以下命令

cd /usr/local/hadoop-2.8.0/share/hadoop/mapreduce (切换目录)

hadoop jar hadoop-mapreduce-examples-2.8.0.jar wordcount /input/output2(单词计数)

结果如下图所示:

6:查看输出结果的目录 hadoop fs -ls /outpu2 图为最终结果文件

7:查看最终结果 hadoop fs -cat /output2/part-r-00000

如果出现了图上的状态我们的wordCount就算是配置好了

接下来我们写代码的部分

首先我们用的是Eclipse

我们要建一个Maven项目正常的java就可以没必要是web的

首先我们修改pom.xml文件

添加节点：

  <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-common</artifactId>  
            <version>2.2.0</version>  
        </dependency>  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-hdfs</artifactId>  
            <version>2.2.0</version>  
        </dependency>  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-client</artifactId>  
            <version>2.2.0</version>  
        </dependency>  
        <dependency>  
            <groupId>junit</groupId>  
            <artifactId>junit</artifactId>  
            <version>3.8.1</version>  
            <scope>test</scope>  
        </dependency>

代码：

package cn.happy.Word;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class WordCount {
	static final String INPUT_PATH = "hdfs://192.168.1.9:9000/input/LICENSE.txt";
	static final String OUTPUT_PATH = "hdfs://192.168.1.9:9000/output";
	// KEYIN 偏移量 代表读取几个字符 起始位置
	// VALUEIN 文本内容
	// KEYOUT 单词
	// VALUEOUT 出现的次数
	static class MyMapper extends
	        //四个泛型
	        //No.1 代表行的偏移量   Map 方法执行之前   0行到字符(字节的数量)
	        //No.2 行的内容   Hello World 
	        //No.3 map方法执行结束之后，要转交给Reducer的键值对类型  Hello 1  World 1
			Mapper<LongWritable, Text, Text, LongWritable> {
		@Override
		//Key行的偏移量
		//Value的值  Hello World
		protected void map(LongWritable key, Text value,
				Mapper<LongWritable, Text, Text, LongWritable>.Context context)
				throws IOException, InterruptedException {
			// 转为String类型
			String str = value.toString();
			// 根据文件内容将字符串拆分为String数组 按空格拆分
			String[] split = str.split(" ");
			for (String string : split) {
				/*
				 * Hello 1
				 * World 1
				 * Me 1
				 * Hello 1
				 */
				context.write(new Text(string), new LongWritable(1));
			}
		}
	}

	// KEYIN 行中单词
	// VALUEIN 行中的单词数量
	// KEYOUT 不同单词
	// VALUEOUT 总次数
	
	/*
	 * Hello 1
	 * World 1
	 * Me 1
	 * Hello 1
	 */
	static class MyReducer extends
			Reducer<Text, LongWritable, Text, LongWritable> {
		@Override
		protected void reduce(Text t1, Iterable<LongWritable> arg1,
				Reducer<Text, LongWritable, Text, LongWritable>.Context ctx)
				throws IOException, InterruptedException {
			long t = 0;
			for (LongWritable longWritable : arg1) {
				t += longWritable.get();
			}
			ctx.write(t1, new LongWritable(t));
		}
	}

	public static void main(String[] args) throws Exception {
		 System.setProperty("hadoop.home.dir", "E:\\Y2\\Y2\\Hadoop大数据\\hadoop-2.8.0");
		 Configuration conf = new Configuration();
         final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
         final Path outPath = new Path(OUTPUT_PATH);
         if(fileSystem.exists(outPath)){
             fileSystem.delete(outPath, true);
         }
         final Job job = new Job(conf,WordCount.class.getSimpleName());
         FileInputFormat.setInputPaths(job, new Path(INPUT_PATH)); 
         
         job.setInputFormatClass(TextInputFormat.class);//指定如何对输入文件进行格式化，把输入文件每一行解析成键值对
         job.setMapperClass(MyMapper.class);//1.2 指定自定义的map类
         job.setMapOutputKeyClass(Text.class);//map输出的<k,v>类型。如果<k3,v3>的类型与<k2,v2>类型一致，则可以省略
         job.setMapOutputValueClass(LongWritable.class);
         
         job.setPartitionerClass(HashPartitioner.class);//1.3 分区    
         job.setNumReduceTasks(1);//有一个reduce任务运行
         //1.4 TODO 排序、分组
         //1.5 TODO 规约
         job.setReducerClass(MyReducer.class);//2.2 指定自定义reduce类
         job.setOutputKeyClass(Text.class);//指定reduce的输出类型
         job.setOutputValueClass(LongWritable.class);//2.3 指定写出到哪里
         FileOutputFormat.setOutputPath(job, outPath);//指定输出文件的格式化类
         
         job.setOutputFormatClass(TextOutputFormat.class);
         
         job.waitForCompletion(true);//把job提交给JobTracker运行
	}
}

我们需要一个把后台代码中的改成己写的但是我们需要包名代码都和后台的一样：

最低0.47元/天解锁文章

_苏小白

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Linux 单词计数 WordCount 以及代码案例

WordCount首先是命令行的： WordCount(单词计数)1:启动hadoop使用 start-all.sh命令启动hdfs 2:在hadoop的安装目录下新建一个目录，使用hdfs的shell命令cd /usr/local/hadoop-2.8.0 切换目录hdfs fs -mkdir /input 3:h
复制链接

扫一扫