编写运行hadoop mapreduce程序

最新推荐文章于 2023-08-25 10:15:36 发布

wuxidemo

最新推荐文章于 2023-08-25 10:15:36 发布

阅读量896

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/wuxidemo/article/details/77168449

版权

hadoop 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

前提：

http://blog.csdn.net/wuxidemo/article/details/77115931

设置好分布式部署，并启动

idea开发mapreduce程序：

1. 新建maven project，修改pom.xml增加hadoop相关依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.yf</groupId>
    <artifactId>mapreduce</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <packaging>jar</packaging>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <hadoop.version>3.0.0-alpha4</hadoop.version>
    </properties>
    
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>

2. 编写mapreduce代码：

package com.yf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;

public class InvertedIndexMapReduce {

    public static class Map extends Mapper<LongWritable,Text,Text,Text>{

        private Text documentId;
        private Text word = new Text();

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            String filename = ((FileSplit)context.getInputSplit()).getPath().getName();
            documentId = new Text(filename);
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            for(String token : StringUtils.split(value.toString())){
                word.set(token);
                context.write(word,documentId);
            }
        }
    }


    public static class Reduce extends Reducer<Text,Text,Text,Text>{
        private Text docIds = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            HashSet<Text> uniqueDocIds = new HashSet<Text>();
            for(Text docId: values){
                uniqueDocIds.add(new Text(docId));
            }

            docIds.set(new Text(StringUtils.join(uniqueDocIds,",")));
            context.write(key,docIds);
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Job job = new Job(conf);

        job.setJarByClass(InvertedIndexMapReduce.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        Path outputPath = new Path(args[1]);
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,outputPath);

        outputPath.getFileSystem(conf).delete(outputPath,true);
        job.waitForCompletion(true);//通知jobtracker运行这个作业，并阻塞直到作业完成
    }
    
}

3. 编译打包成jar包

源代码目录运行 mvn package

4. 运行到hadoop环境里

拷贝jar到运行hadoop的centos的系统中

新建几个txt文件并传送到hdfs的/input目录下，例如

hadoop fs -put file1.txt /input

hadoop fs -put file2.txt /input

运行 hadoop jar xxx.jar com.yf.InvertedIndexMapReduce /input/*.txt /output

大致运行log如下：

2017-08-14 20:46:36,031 INFO client.RMProxy: Connecting to ResourceManager at hadoop.master/192.168.0.116:8040
2017-08-14 20:46:37,022 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-08-14 20:46:37,459 INFO input.FileInputFormat: Total input files to process : 2
2017-08-14 20:46:37,614 INFO mapreduce.JobSubmitter: number of splits:2
2017-08-14 20:46:37,829 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2017-08-14 20:46:38,164 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1502708248236_0013
2017-08-14 20:46:38,517 INFO impl.YarnClientImpl: Submitted application application_1502708248236_0013
2017-08-14 20:46:38,645 INFO mapreduce.Job: The url to track the job: http://hadoop.master:8088/proxy/application_1502708248236_0013/
2017-08-14 20:46:38,646 INFO mapreduce.Job: Running job: job_1502708248236_0013
2017-08-14 20:46:48,992 INFO mapreduce.Job: Job job_1502708248236_0013 running in uber mode : false
2017-08-14 20:46:48,995 INFO mapreduce.Job: map 0% reduce 0%

......

2017-08-14 20:47:09,551 INFO mapreduce.Job: map 100% reduce 0%

......

2017-08-14 20:47:34,000 INFO mapreduce.Job: map 100% reduce 100%

2017-08-14 20:47:34,026 INFO mapreduce.Job: Job job_1502708248236_0013 completed successfully
2017-08-14 20:47:34,338 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=102
FILE: Number of bytes written=570254
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=224
HDFS: Number of bytes written=76
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed map tasks=4
Failed reduce tasks=1
Killed map tasks=1
Launched map tasks=6
Launched reduce tasks=2
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=48685
Total time spent by all reduces in occupied slots (ms)=17000
Total time spent by all map tasks (ms)=48685
Total time spent by all reduce tasks (ms)=17000
Total vcore-milliseconds taken by all map tasks=48685
Total vcore-milliseconds taken by all reduce tasks=17000
Total megabyte-milliseconds taken by all map tasks=49853440
Total megabyte-milliseconds taken by all reduce tasks=17408000
Map-Reduce Framework
Map input records=2
Map output records=6
Map output bytes=84
Map output materialized bytes=108
Input split bytes=200
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=108
Reduce input records=6
Reduce output records=4
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=389
CPU time spent (ms)=2260
Physical memory (bytes) snapshot=564514816
Virtual memory (bytes) snapshot=7572033536
Total committed heap usage (bytes)=301146112
Peak Map Physical memory (bytes)=224354304
Peak Map Virtual memory (bytes)=2521751552
Peak Reduce Physical memory (bytes)=117166080
Peak Reduce Virtual memory (bytes)=2528530432
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=76

此时查看hdfs的output目录，运行结果如下：

[glsc@hadoop java_app]$ hadoop fs -ls /output
Found 2 items
-rw-r--r-- 2 glsc supergroup 0 2017-08-14 20:47 /output/_SUCCESS
-rw-r--r-- 2 glsc supergroup 76 2017-08-14 20:47 /output/part-r-00000

[glsc@hadoop java_app]$ hadoop fs -cat /output/part-r-00000
cat file1.txt,file2.txt
dog file2.txt
mat file1.txt
sat file1.txt,file2.txt

例子是一个反向索引，运行结果显示成功。