MapReduce

一、MapReduce简介

MapReduce是Google的一项重要技术,它首先是一个编程模型,用对于大数据量的计算,通常采用的处理手法就是并行计算。以进行大数据量的计算。MapReduce通过简化编程模型,降低了开发并行应用的入门门槛

(1)Mapper负责“分”,即把复杂的任务分解为若干个“简单的任务”来处理。“简单的任务”包含三层含义:一是数据或计算的规模相对原任务要大大缩小;二是就近计算原则,即任务会分配到存放着所需数据的节点上进行计算;三是这些小任务可以并行计算,彼此间几乎没有依赖关系。

(2)Reducer负责对map阶段的结果进行汇总。至于需要多少个Reducer,用户可以根据具体问题,通过在mapred-site.xml配置文件里设置参数mapred.reduce.tasks的值,缺省值为1。

一个比较形象的语言解释MapReduce:

We want to count all the books in the library. You count up shelf #1, I count up shelf #2. That’s map. The more people we get, the faster it goes.

我们要数图书馆中的所有书。你数1号书架,我数2号书架。这就是“Map”。我们人越多,数书就更快。

Now we get together and add our individual counts. That’s reduce.

现在我们到一起,把所有人的统计数加在一起。这就是“Reduce”。

二、windows intelij 跑 MR WordCount

1.创建maven工程

2.添加maven依赖

在pom.xml添加依赖,对于hadoop 2.7.3版本的hadoop,需要的jar包有以下几个:

  • hadoop-common

  • hadoop-hdfs

  • hadoop-mapreduce-client-core

  • hadoop-mapreduce-client-jobclient

  • log4j( 打印日志)

    pom.xml中的依赖如下:

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.3</version>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
    </dependencies>

3.配置log4j

src/main/resources目录下新增log4j的配置文件log4j.properties,内容如下:

log4j.rootLogger = debug,stdout

### 输出信息到控制抬 ###
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

4.启动Hadoop

5.运行WordCount(从本地读取文件)

在工程根目录下新建input文件夹,input文件夹下新增dream.txt,随便写入一些单词:

I have a  dream
a dream

在src/main/java目录下新建包,新增FileUtil.java,创建一个删除output文件的函数,以后就不用手动删除了。内容如下:

package com.mrtest.hadoop;

import java.io.File;

/**
 * Created by bee on 3/25/17.
 */
public class FileUtil {

    public static boolean deleteDir(String path) {
        File dir = new File(path);
        if (dir.exists()) {
            for (File f : dir.listFiles()) {
                if (f.isDirectory()) {
                    deleteDir(f.getName());
                } else {
                    f.delete();
                }
            }
            dir.delete();
            return true;
        } else {
            System.out.println("文件(夹)不存在!");
            return false;
        }
    }

}

2、wordcount程序实现
1、编写map方法

package com.neusoft;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * 该类做为一个 mapTask 使用。类声名中所使用的四个泛型意义为别为:
 *
 * KEYIN:   默认情况下,是mr框架所读到的一行文本的起始偏移量,Long,
 *      但是在hadoop中有自己的更精简的序列化接口,所以不直接用Long,而用LongWritable
 * VALUEIN: 默认情况下,是mr框架所读到的一行文本的内容,String,同上,用Text
 * KEYOUT:  是用户自定义逻辑处理完成之后输出数据中的key,在此处是单词,String,同上,用Text
 * VALUEOUT:是用户自定义逻辑处理完成之后输出数据中的value,在此处是单词次数,Integer,同上,用IntWritable
 */
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    /**
     * map阶段的业务逻辑就写在自定义的map()方法中 maptask会对每一行输入数据调用一次我们自定义的map()方法
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 将maptask传给我们的文本内容先转换成String
        String line = value.toString();
        // 根据空格将这一行切分成单词
        String[] words = line.split(" ");

        // 将单词输出为<单词,1>
        for (String word : words) {
            // 将单词作为key,将次数1作为value,以便于后续的数据分发,可以根据单词分发,以便于相同单词会到相同的reduce task
            context.write(new Text(word), new IntWritable(1));
        }
    }

}

2、编写reduce方法

package com.neusoft;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * 与 Mapper 类似,继承的同事声名四个泛型。
 * KEYIN, VALUEIN 对应  mapper输出的KEYOUT,VALUEOUT类型对应
 * KEYOUT, VALUEOUT 是自定义reduce逻辑处理结果的输出数据类型。此处 keyOut 表示单个单词,valueOut 对应的是总次数
 */
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    /**
     * <angelababy,1><angelababy,1><angelababy,1><angelababy,1><angelababy,1>
     * <hello,1><hello,1><hello,1><hello,1><hello,1><hello,1>
     * <banana,1><banana,1><banana,1><banana,1><banana,1><banana,1>
     * 入参key,是一组相同单词kv对的key
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count=0;
        for(IntWritable value : values){
            count += value.get();
        }

        context.write(key, new IntWritable(count));     //输出每一个单词出现的次数
    }

}

解释:
map()方法中我们设置的时候单词做为key,1作为value写出,比如

package com.neusoft;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 相当于一个yarn集群的客户端
 * 需要在此封装我们的mr程序的相关运行参数,指定jar包
 * 最后提交给yarn
 */
public class WordcountDriver {

    public static void main(String[] args) throws Exception {
        System.setProperty("hadoop.home.dir", "e:/hadoop-2.8.3");
        if (args == null || args.length == 0) {
            return;
        }

        //该对象会默认读取环境中的 hadoop 配置。当然,也可以通过 set 重新进行配置
        Configuration conf = new Configuration();

        //job 是 yarn 中任务的抽象。
        Job job = Job.getInstance(conf);

        /*job.setJar("/home/hadoop/wc.jar");*/
        //指定本程序的jar包所在的本地路径
        job.setJarByClass(WordcountDriver.class);

        //指定本业务job要使用的mapper/Reducer业务类
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);

        //指定mapper输出数据的kv类型。需要和 Mapper 中泛型的类型保持一致
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定最终输出的数据的kv类型。这里也是 Reduce 的 key,value类型。
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //指定job的输入原始文件所在目录
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        //指定job的输出结果所在目录
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //将job中配置的相关参数,以及job所用的java类所在的jar包,提交给yarn去运行
        /*job.submit();*/
        boolean res = job.waitForCompletion(true);
        System.exit(res?0:1);
    }
}

这里在main函数中新增了一个String类型的数组,如果想用main函数的args数组接受参数,在运行时指定输入和输出路径也是可以的。运行WordCount之前,配置Configuration并指定Program arguments即可。

6.运行WordCount(从HDFS读取文件)

在HDFS上新建文件夹:

hadoop fs -mkdir /worddir

如果出现Namenode安全模式导致的不能创建文件夹提示:

mkdir: Cannot create directory /worddir. Name node is in safe mode.

运行以下命令关闭safe mode:

hadoop dfsadmin -safemode leave

上传本地文件:

hadoop fs -put dream.txt /worddir

修改otherArgs参数,指定输入为文件在HDFS上的路径:

String[] otherArgs = new String[]{"hdfs://localhost:8020/worddir/dream.txt","output"};

7.解决org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

Windows7出现这个问题
需要在你的 Module下创建一个名为org.apache.hadoop.io.nativeio的包
在这个包下创建一个Class 名为 NativeIO
然后粘贴如下代码

/**
 * Licensed to the Apache Software Foundation (ASF) under one 
 * or more contributor license agreements.  See the NOTICE file 
 * distributed with this work for additional information 
 * regarding copyright ownership.  The ASF licenses this file 
 * to you under the Apache License, Version 2.0 (the 
 * "License"); you may not use this file except in compliance 
 * with the License.  You may obtain a copy of the License at 
 *
 *     http://www.apache.org/licenses/LICENSE-2.0 
 *
 * Unless required by applicable law or agreed to in writing, software 
 * distributed under the License is distributed on an "AS IS" BASIS, 
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
 * See the License for the specific language governing permissions and 
 * limitations under the License. 
 */
package org.apache.hadoop.io.nativeio;

import java.io.File;
import java.io.FileDescriptor;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.CommonConfigurationKeys;
import org.apache.hadoop.fs.HardLink;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SecureIOUtils.AlreadyExistsException;
import org.apache.hadoop.util.NativeCodeLoader;
import org.apache.hadoop.util.Shell;
import org.apache.hadoop.util.PerformanceAdvisory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import sun.misc.Unsafe;

import com.google.common.annotations.VisibleForTesting;

/**
 * JNI wrappers for various native IO-related calls not available in Java. 
 * These functions should generally be used alongside a fallback to another 
 * more portable mechanism. 
 */
@InterfaceAudience.Private
@InterfaceStability.Unstable
public class NativeIO {
    public static class POSIX {
        // Flags for open() call from bits/fcntl.h
        public static final int O_RDONLY   =    00;
        public static final int O_WRONLY   =    01;
        public static final int O_RDWR     =    02;
        public static final int O_CREAT    =  0100;
        public static final int O_EXCL     =  0200;
        public static final int O_NOCTTY   =  0400;
        public static final int O_TRUNC    = 01000;
        public static final int O_APPEND   = 02000;
        public static final int O_NONBLOCK = 04000;
        public static final int O_SYNC   =  010000;
        public static final int O_ASYNC  =  020000;
        public static final int O_FSYNC = O_SYNC;
        public static final int O_NDELAY = O_NONBLOCK;

        // Flags for posix_fadvise() from bits/fcntl.h
    /* No further special treatment.  */
        public static final int POSIX_FADV_NORMAL = 0;
        /* Expect random page references.  */
        public static final int POSIX_FADV_RANDOM = 1;
        /* Expect sequential page references.  */
        public static final int POSIX_FADV_SEQUENTIAL = 2;
        /* Will need these pages.  */
        public static final int POSIX_FADV_WILLNEED = 3;
        /* Don't need these pages.  */
        public static final int POSIX_FADV_DONTNEED = 4;
        /* Data will be accessed once.  */
        public static final int POSIX_FADV_NOREUSE = 5;


        /* Wait upon writeout of all pages
           in the range before performing the
           write.  */
        public static final int SYNC_FILE_RANGE_WAIT_BEFORE = 1;
        /* Initiate writeout of all those
           dirty pages in the range which are
           not presently under writeback.  */
        public static final int SYNC_FILE_RANGE_WRITE = 2;

        /* Wait upon writeout of all pages in
           the range after performing the
           write.  */
        public static final int SYNC_FILE_RANGE_WAIT_AFTER = 4;

        private static final Log LOG = LogFactory.getLog(NativeIO.class);

        private static boolean nativeLoaded = false;
        private static boolean fadvisePossible = true;
        private static boolean syncFileRangePossible = true;

        static final String WORKAROUND_NON_THREADSAFE_CALLS_KEY =
                "hadoop.workaround.non.threadsafe.getpwuid";
        static final boolean WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT = true;

        private static long cacheTimeout = -1;

        private static CacheManipulator cacheManipulator = new CacheManipulator();

        public static CacheManipulator getCacheManipulator() {
            return cacheManipulator;
        }

        public static void setCacheManipulator(CacheManipulator cacheManipulator) {
            POSIX.cacheManipulator = cacheManipulator;
        }

        /**
         * Used to manipulate the operating system cache.
         */
        @VisibleForTesting
        public static class CacheManipulator {
            public void mlock(String identifier, ByteBuffer buffer,
                              long len) throws IOException {
                POSIX.mlock(buffer, len);
            }

            public long getMemlockLimit() {
                return NativeIO.getMemlockLimit();
            }

            public long getOperatingSystemPageSize() {
                return NativeIO.getOperatingSystemPageSize();
            }

            public void posixFadviseIfPossible(String identifier,
                                               FileDescriptor fd, long offset, long len, int flags)
                    throws NativeIOException {
                NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset,
                        len, flags);
            }

            public boolean verifyCanMlock() {
                return NativeIO.isAvailable();
            }
        }

        /**
         * A CacheManipulator used for testing which does not actually call mlock.
         * This allows many tests to be run even when the operating system does not
         * allow mlock, or only allows limited mlocking.
         */
        @VisibleForTesting
        public static class NoMlockCacheManipulator extends CacheManipulator {
            public void mlock(String identifier, ByteBuffer buffer,
                              long len) throws IOException {
                LOG.info("mlocking " + identifier);
            }

            public long getMemlockLimit() {
                return 1125899906842624L;
            }

            public long getOperatingSystemPageSize() {
                return 4096;
            }

            public boolean verifyCanMlock() {
                return true;
            }
        }

        static {
            if (NativeCodeLoader.isNativeCodeLoaded()) {
                try {
                    Configuration conf = new Configuration();
                    workaroundNonThreadSafePasswdCalls = conf.getBoolean(
                            WORKAROUND_NON_THREADSAFE_CALLS_KEY,
                            WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT);

                    initNative();
                    nativeLoaded = true;

                    cacheTimeout = conf.getLong(
                            CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_KEY,
                            CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_DEFAULT) *
                            1000;
                    LOG.debug("Initialized cache for IDs to User/Group mapping with a " +
                            " cache timeout of " + cacheTimeout/1000 + " seconds.");

                } catch (Throwable t) {
                    // This can happen if the user has an older version of libhadoop.so
                    // installed - in this case we can continue without native IO
                    // after warning
                    PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
                }
            }
        }

        /**
         * Return true if the JNI-based native IO extensions are available.
         */
        public static boolean isAvailable() {
            return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
        }

        private static void assertCodeLoaded() throws IOException {
            if (!isAvailable()) {
                throw new IOException("NativeIO was not loaded");
            }
        }

        /** Wrapper around open(2) */
        public static native FileDescriptor open(String path, int flags, int mode) throws IOException;
        /** Wrapper around fstat(2) */
        private static native Stat fstat(FileDescriptor fd) throws IOException;

        /** Native chmod implementation. On UNIX, it is a wrapper around chmod(2) */
        private static native void chmodImpl(String path, int mode) throws IOException;

        public static void chmod(String path, int mode) throws IOException {
            if (!Shell.WINDOWS) {
                chmodImpl(path, mode);
            } else {
                try {
                    chmodImpl(path, mode);
                } catch (NativeIOException nioe) {
                    if (nioe.getErrorCode() == 3) {
                        throw new NativeIOException("No such file or directory",
                                Errno.ENOENT);
                    } else {
                        LOG.warn(String.format("NativeIO.chmod error (%d): %s",
                                nioe.getErrorCode(), nioe.getMessage()));
                        throw new NativeIOException("Unknown error", Errno.UNKNOWN);
                    }
                }
            }
        }

        /** Wrapper around posix_fadvise(2) */
        static native void posix_fadvise(
                FileDescriptor fd, long offset, long len, int flags) throws NativeIOException;

        /** Wrapper around sync_file_range(2) */
        static native void sync_file_range(
                FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException;

        /**
         * Call posix_fadvise on the given file descriptor. See the manpage
         * for this syscall for more information. On systems where this
         * call is not available, does nothing.
         *
         * @throws NativeIOException if there is an error with the syscall
         */
        static void posixFadviseIfPossible(String identifier,
                                           FileDescriptor fd, long offset, long len, int flags)
                throws NativeIOException {
            if (nativeLoaded && fadvisePossible) {
                try {
                    posix_fadvise(fd, offset, len, flags);
                } catch (UnsupportedOperationException uoe) {
                    fadvisePossible = false;
                } catch (UnsatisfiedLinkError ule) {
                    fadvisePossible = false;
                }
            }
        }

        /**
         * Call sync_file_range on the given file descriptor. See the manpage
         * for this syscall for more information. On systems where this
         * call is not available, does nothing.
         *
         * @throws NativeIOException if there is an error with the syscall
         */
        public static void syncFileRangeIfPossible(
                FileDescriptor fd, long offset, long nbytes, int flags)
                throws NativeIOException {
            if (nativeLoaded && syncFileRangePossible) {
                try {
                    sync_file_range(fd, offset, nbytes, flags);
                } catch (UnsupportedOperationException uoe) {
                    syncFileRangePossible = false;
                } catch (UnsatisfiedLinkError ule) {
                    syncFileRangePossible = false;
                }
            }
        }

        static native void mlock_native(
                ByteBuffer buffer, long len) throws NativeIOException;

        /**
         * Locks the provided direct ByteBuffer into memory, preventing it from
         * swapping out. After a buffer is locked, future accesses will not incur
         * a page fault.
         *
         * See the mlock(2) man page for more information.
         *
         * @throws NativeIOException
         */
        static void mlock(ByteBuffer buffer, long len)
                throws IOException {
            assertCodeLoaded();
            if (!buffer.isDirect()) {
                throw new IOException("Cannot mlock a non-direct ByteBuffer");
            }
            mlock_native(buffer, len);
        }

        /**
         * Unmaps the block from memory. See munmap(2).
         *
         * There isn't any portable way to unmap a memory region in Java.
         * So we use the sun.nio method here.
         * Note that unmapping a memory region could cause crashes if code
         * continues to reference the unmapped code.  However, if we don't
         * manually unmap the memory, we are dependent on the finalizer to
         * do it, and we have no idea when the finalizer will run.
         *
         * @param buffer    The buffer to unmap.
         */
        public static void munmap(MappedByteBuffer buffer) {
            if (buffer instanceof sun.nio.ch.DirectBuffer) {
                sun.misc.Cleaner cleaner =
                        ((sun.nio.ch.DirectBuffer)buffer).cleaner();
                cleaner.clean();
            }
        }

        /** Linux only methods used for getOwner() implementation */
        private static native long getUIDforFDOwnerforOwner(FileDescriptor fd) throws IOException;
        private static native String getUserName(long uid) throws IOException;

        /**
         * Result type of the fstat call
         */
        public static class Stat {
            private int ownerId, groupId;
            private String owner, group;
            private int mode;

            // Mode constants
            public static final int S_IFMT = 0170000;      /* type of file */
            public static final int   S_IFIFO  = 0010000;  /* named pipe (fifo) */
            public static final int   S_IFCHR  = 0020000;  /* character special */
            public static final int   S_IFDIR  = 0040000;  /* directory */
            public static final int   S_IFBLK  = 0060000;  /* block special */
            public static final int   S_IFREG  = 0100000;  /* regular */
            public static final int   S_IFLNK  = 0120000;  /* symbolic link */
            public static final int   S_IFSOCK = 0140000;  /* socket */
            public static final int   S_IFWHT  = 0160000;  /* whiteout */
            public static final int S_ISUID = 0004000;  /* set user id on execution */
            public static final int S_ISGID = 0002000;  /* set group id on execution */
            public static final int S_ISVTX = 0001000;  /* save swapped text even after use */
            public static final int S_IRUSR = 0000400;  /* read permission, owner */
            public static final int S_IWUSR = 0000200;  /* write permission, owner */
            public static final int S_IXUSR = 0000100;  /* execute/search permission, owner */

            Stat(int ownerId, int groupId, int mode) {
                this.ownerId = ownerId;
                this.groupId = groupId;
                this.mode = mode;
            }

            Stat(String owner, String group, int mode) {
                if (!Shell.WINDOWS) {
                    this.owner = owner;
                } else {
                    this.owner = stripDomain(owner);
                }
                if (!Shell.WINDOWS) {
                    this.group = group;
                } else {
                    this.group = stripDomain(group);
                }
                this.mode = mode;
            }

            @Override
            public String toString() {
                return "Stat(owner='" + owner + "', group='" + group + "'" +
                        ", mode=" + mode + ")";
            }

            public String getOwner() {
                return owner;
            }
            public String getGroup() {
                return group;
            }
            public int getMode() {
                return mode;
            }
        }

        /**
         * Returns the file stat for a file descriptor.
         *
         * @param fd file descriptor.
         * @return the file descriptor file stat.
         * @throws IOException thrown if there was an IO error while obtaining the file stat.
         */
        public static Stat getFstat(FileDescriptor fd) throws IOException {
            Stat stat = null;
            if (!Shell.WINDOWS) {
                stat = fstat(fd);
                stat.owner = getName(IdCache.USER, stat.ownerId);
                stat.group = getName(IdCache.GROUP, stat.groupId);
            } else {
                try {
                    stat = fstat(fd);
                } catch (NativeIOException nioe) {
                    if (nioe.getErrorCode() == 6) {
                        throw new NativeIOException("The handle is invalid.",
                                Errno.EBADF);
                    } else {
                        LOG.warn(String.format("NativeIO.getFstat error (%d): %s",
                                nioe.getErrorCode(), nioe.getMessage()));
                        throw new NativeIOException("Unknown error", Errno.UNKNOWN);
                    }
                }
            }
            return stat;
        }

        private static String getName(IdCache domain, int id) throws IOException {
            Map<Integer, CachedName> idNameCache = (domain == IdCache.USER)
                    ? USER_ID_NAME_CACHE : GROUP_ID_NAME_CACHE;
            String name;
            CachedName cachedName = idNameCache.get(id);
            long now = System.currentTimeMillis();
            if (cachedName != null && (cachedName.timestamp + cacheTimeout) > now) {
                name = cachedName.name;
            } else {
                name = (domain == IdCache.USER) ? getUserName(id) : getGroupName(id);
                if (LOG.isDebugEnabled()) {
                    String type = (domain == IdCache.USER) ? "UserName" : "GroupName";
                    LOG.debug("Got " + type + " " + name + " for ID " + id +
                            " from the native implementation");
                }
                cachedName = new CachedName(name, now);
                idNameCache.put(id, cachedName);
            }
            return name;
        }

        static native String getUserName(int uid) throws IOException;
        static native String getGroupName(int uid) throws IOException;

        private static class CachedName {
            final long timestamp;
            final String name;

            public CachedName(String name, long timestamp) {
                this.name = name;
                this.timestamp = timestamp;
            }
        }

        private static final Map<Integer, CachedName> USER_ID_NAME_CACHE =
                new ConcurrentHashMap<Integer, CachedName>();

        private static final Map<Integer, CachedName> GROUP_ID_NAME_CACHE =
                new ConcurrentHashMap<Integer, CachedName>();

        private enum IdCache { USER, GROUP }

        public final static int MMAP_PROT_READ = 0x1;
        public final static int MMAP_PROT_WRITE = 0x2;
        public final static int MMAP_PROT_EXEC = 0x4;

        public static native long mmap(FileDescriptor fd, int prot,
                                       boolean shared, long length) throws IOException;

        public static native void munmap(long addr, long length)
                throws IOException;
    }

    private static boolean workaroundNonThreadSafePasswdCalls = false;


    public static class Windows {
        // Flags for CreateFile() call on Windows
        public static final long GENERIC_READ = 0x80000000L;
        public static final long GENERIC_WRITE = 0x40000000L;

        public static final long FILE_SHARE_READ = 0x00000001L;
        public static final long FILE_SHARE_WRITE = 0x00000002L;
        public static final long FILE_SHARE_DELETE = 0x00000004L;

        public static final long CREATE_NEW = 1;
        public static final long CREATE_ALWAYS = 2;
        public static final long OPEN_EXISTING = 3;
        public static final long OPEN_ALWAYS = 4;
        public static final long TRUNCATE_EXISTING = 5;

        public static final long FILE_BEGIN = 0;
        public static final long FILE_CURRENT = 1;
        public static final long FILE_END = 2;

        public static final long FILE_ATTRIBUTE_NORMAL = 0x00000080L;

        /**
         * Create a directory with permissions set to the specified mode.  By setting
         * permissions at creation time, we avoid issues related to the user lacking
         * WRITE_DAC rights on subsequent chmod calls.  One example where this can
         * occur is writing to an SMB share where the user does not have Full Control
         * rights, and therefore WRITE_DAC is denied.
         *
         * @param path directory to create
         * @param mode permissions of new directory
         * @throws IOException if there is an I/O error
         */
        public static void createDirectoryWithMode(File path, int mode)
                throws IOException {
            createDirectoryWithMode0(path.getAbsolutePath(), mode);
        }

        /** Wrapper around CreateDirectory() on Windows */
        private static native void createDirectoryWithMode0(String path, int mode)
                throws NativeIOException;

        /** Wrapper around CreateFile() on Windows */
        public static native FileDescriptor createFile(String path,
                                                       long desiredAccess, long shareMode, long creationDisposition)
                throws IOException;

        /**
         * Create a file for write with permissions set to the specified mode.  By
         * setting permissions at creation time, we avoid issues related to the user
         * lacking WRITE_DAC rights on subsequent chmod calls.  One example where
         * this can occur is writing to an SMB share where the user does not have
         * Full Control rights, and therefore WRITE_DAC is denied.
         *
         * This method mimics the semantics implemented by the JDK in
         * {@link java.io.FileOutputStream}.  The file is opened for truncate or
         * append, the sharing mode allows other readers and writers, and paths
         * longer than MAX_PATH are supported.  (See io_util_md.c in the JDK.)
         *
         * @param path file to create
         * @param append if true, then open file for append
         * @param mode permissions of new directory
         * @return FileOutputStream of opened file
         * @throws IOException if there is an I/O error
         */
        public static FileOutputStream createFileOutputStreamWithMode(File path,
                                                                      boolean append, int mode) throws IOException {
            long desiredAccess = GENERIC_WRITE;
            long shareMode = FILE_SHARE_READ | FILE_SHARE_WRITE;
            long creationDisposition = append ? OPEN_ALWAYS : CREATE_ALWAYS;
            return new FileOutputStream(createFileWithMode0(path.getAbsolutePath(),
                    desiredAccess, shareMode, creationDisposition, mode));
        }

        /** Wrapper around CreateFile() with security descriptor on Windows */
        private static native FileDescriptor createFileWithMode0(String path,
                                                                 long desiredAccess, long shareMode, long creationDisposition, int mode)
                throws NativeIOException;

        /** Wrapper around SetFilePointer() on Windows */
        public static native long setFilePointer(FileDescriptor fd,
                                                 long distanceToMove, long moveMethod) throws IOException;

        /** Windows only methods used for getOwner() implementation */
        private static native String getOwner(FileDescriptor fd) throws IOException;

        /** Supported list of Windows access right flags */
        public static enum AccessRight {
            ACCESS_READ (0x0001),      // FILE_READ_DATA
            ACCESS_WRITE (0x0002),     // FILE_WRITE_DATA
            ACCESS_EXECUTE (0x0020);   // FILE_EXECUTE

            private final int accessRight;
            AccessRight(int access) {
                accessRight = access;
            }

            public int accessRight() {
                return accessRight;
            }
        };

        /** Windows only method used to check if the current process has requested
         *  access rights on the given path. */
        private static native boolean access0(String path, int requestedAccess);

        /**
         * Checks whether the current process has desired access rights on
         * the given path.
         *
         * Longer term this native function can be substituted with JDK7
         * function Files#isReadable, isWritable, isExecutable.
         *
         * @param path input path
         * @param desiredAccess ACCESS_READ, ACCESS_WRITE or ACCESS_EXECUTE
         * @return true if access is allowed
         * @throws IOException I/O exception on error
         */
        public static boolean access(String path, AccessRight desiredAccess)
                throws IOException {
            return true;
            // return access0(path, desiredAccess.accessRight());
        }

        /**
         * Extends both the minimum and maximum working set size of the current
         * process.  This method gets the current minimum and maximum working set
         * size, adds the requested amount to each and then sets the minimum and
         * maximum working set size to the new values.  Controlling the working set
         * size of the process also controls the amount of memory it can lock.
         *
         * @param delta amount to increment minimum and maximum working set size
         * @throws IOException for any error
         * @see POSIX#mlock(ByteBuffer, long)
         */
        public static native void extendWorkingSetSize(long delta) throws IOException;

        static {
            if (NativeCodeLoader.isNativeCodeLoaded()) {
                try {
                    initNative();
                    nativeLoaded = true;
                } catch (Throwable t) {
                    // This can happen if the user has an older version of libhadoop.so
                    // installed - in this case we can continue without native IO
                    // after warning
                    PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
                }
            }
        }
    }

    private static final Log LOG = LogFactory.getLog(NativeIO.class);

    private static boolean nativeLoaded = false;

    static {
        if (NativeCodeLoader.isNativeCodeLoaded()) {
            try {
                initNative();
                nativeLoaded = true;
            } catch (Throwable t) {
                // This can happen if the user has an older version of libhadoop.so
                // installed - in this case we can continue without native IO
                // after warning
                PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
            }
        }
    }

    /**
     * Return true if the JNI-based native IO extensions are available.
     */
    public static boolean isAvailable() {
        return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
    }

    /** Initialize the JNI method ID and class ID cache */
    private static native void initNative();

    /**
     * Get the maximum number of bytes that can be locked into memory at any
     * given point.
     *
     * @return 0 if no bytes can be locked into memory;
     *         Long.MAX_VALUE if there is no limit;
     *         The number of bytes that can be locked into memory otherwise.
     */
    static long getMemlockLimit() {
        return isAvailable() ? getMemlockLimit0() : 0;
    }

    private static native long getMemlockLimit0();

    /**
     * @return the operating system's page size.
     */
    static long getOperatingSystemPageSize() {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            Unsafe unsafe = (Unsafe)f.get(null);
            return unsafe.pageSize();
        } catch (Throwable e) {
            LOG.warn("Unable to get operating system page size.  Guessing 4096.", e);
            return 4096;
        }
    }

    private static class CachedUid {
        final long timestamp;
        final String username;
        public CachedUid(String username, long timestamp) {
            this.timestamp = timestamp;
            this.username = username;
        }
    }
    private static final Map<Long, CachedUid> uidCache =
            new ConcurrentHashMap<Long, CachedUid>();
    private static long cacheTimeout;
    private static boolean initialized = false;

    /**
     * The Windows logon name has two part, NetBIOS domain name and
     * user account name, of the format DOMAIN\UserName. This method
     * will remove the domain part of the full logon name.
     *
     * @param Fthe full principal name containing the domain
     * @return name with domain removed
     */
    private static String stripDomain(String name) {
        int i = name.indexOf('\\');
        if (i != -1)
            name = name.substring(i + 1);
        return name;
    }

    public static String getOwner(FileDescriptor fd) throws IOException {
        ensureInitialized();
        if (Shell.WINDOWS) {
            String owner = Windows.getOwner(fd);
            owner = stripDomain(owner);
            return owner;
        } else {
            long uid = POSIX.getUIDforFDOwnerforOwner(fd);
            CachedUid cUid = uidCache.get(uid);
            long now = System.currentTimeMillis();
            if (cUid != null && (cUid.timestamp + cacheTimeout) > now) {
                return cUid.username;
            }
            String user = POSIX.getUserName(uid);
            LOG.info("Got UserName " + user + " for UID " + uid
                    + " from the native implementation");
            cUid = new CachedUid(user, now);
            uidCache.put(uid, cUid);
            return user;
        }
    }

    /**
     * Create a FileInputStream that shares delete permission on the
     * file opened, i.e. other process can delete the file the
     * FileInputStream is reading. Only Windows implementation uses
     * the native interface.
     */
    public static FileInputStream getShareDeleteFileInputStream(File f)
            throws IOException {
        if (!Shell.WINDOWS) {
            // On Linux the default FileInputStream shares delete permission
            // on the file opened.
            //
            return new FileInputStream(f);
        } else {
            // Use Windows native interface to create a FileInputStream that
            // shares delete permission on the file opened.
            //
            FileDescriptor fd = Windows.createFile(
                    f.getAbsolutePath(),
                    Windows.GENERIC_READ,
                    Windows.FILE_SHARE_READ |
                            Windows.FILE_SHARE_WRITE |
                            Windows.FILE_SHARE_DELETE,
                    Windows.OPEN_EXISTING);
            return new FileInputStream(fd);
        }
    }

    /**
     * Create a FileInputStream that shares delete permission on the
     * file opened at a given offset, i.e. other process can delete
     * the file the FileInputStream is reading. Only Windows implementation
     * uses the native interface.
     */
    public static FileInputStream getShareDeleteFileInputStream(File f, long seekOffset)
            throws IOException {
        if (!Shell.WINDOWS) {
            RandomAccessFile rf = new RandomAccessFile(f, "r");
            if (seekOffset > 0) {
                rf.seek(seekOffset);
            }
            return new FileInputStream(rf.getFD());
        } else {
            // Use Windows native interface to create a FileInputStream that
            // shares delete permission on the file opened, and set it to the
            // given offset.
            //
            FileDescriptor fd = NativeIO.Windows.createFile(
                    f.getAbsolutePath(),
                    NativeIO.Windows.GENERIC_READ,
                    NativeIO.Windows.FILE_SHARE_READ |
                            NativeIO.Windows.FILE_SHARE_WRITE |
                            NativeIO.Windows.FILE_SHARE_DELETE,
                    NativeIO.Windows.OPEN_EXISTING);
            if (seekOffset > 0)
                NativeIO.Windows.setFilePointer(fd, seekOffset, NativeIO.Windows.FILE_BEGIN);
            return new FileInputStream(fd);
        }
    }

    /**
     * Create the specified File for write access, ensuring that it does not exist.
     * @param f the file that we want to create
     * @param permissions we want to have on the file (if security is enabled)
     *
     * @throws AlreadyExistsException if the file already exists
     * @throws IOException if any other error occurred
     */
    public static FileOutputStream getCreateForWriteFileOutputStream(File f, int permissions)
            throws IOException {
        if (!Shell.WINDOWS) {
            // Use the native wrapper around open(2)
            try {
                FileDescriptor fd = NativeIO.POSIX.open(f.getAbsolutePath(),
                        NativeIO.POSIX.O_WRONLY | NativeIO.POSIX.O_CREAT
                                | NativeIO.POSIX.O_EXCL, permissions);
                return new FileOutputStream(fd);
            } catch (NativeIOException nioe) {
                if (nioe.getErrno() == Errno.EEXIST) {
                    throw new AlreadyExistsException(nioe);
                }
                throw nioe;
            }
        } else {
            // Use the Windows native APIs to create equivalent FileOutputStream
            try {
                FileDescriptor fd = NativeIO.Windows.createFile(f.getCanonicalPath(),
                        NativeIO.Windows.GENERIC_WRITE,
                        NativeIO.Windows.FILE_SHARE_DELETE
                                | NativeIO.Windows.FILE_SHARE_READ
                                | NativeIO.Windows.FILE_SHARE_WRITE,
                        NativeIO.Windows.CREATE_NEW);
                NativeIO.POSIX.chmod(f.getCanonicalPath(), permissions);
                return new FileOutputStream(fd);
            } catch (NativeIOException nioe) {
                if (nioe.getErrorCode() == 80) {
                    // ERROR_FILE_EXISTS
                    // 80 (0x50)
                    // The file exists
                    throw new AlreadyExistsException(nioe);
                }
                throw nioe;
            }
        }
    }

    private synchronized static void ensureInitialized() {
        if (!initialized) {
            cacheTimeout =
                    new Configuration().getLong("hadoop.security.uid.cache.secs",
                            4*60*60) * 1000;
            LOG.info("Initialized cache for UID to User mapping with a cache" +
                    " timeout of " + cacheTimeout/1000 + " seconds.");
            initialized = true;
        }
    }

    /**
     * A version of renameTo that throws a descriptive exception when it fails.
     *
     * @param src                  The source path
     * @param dst                  The destination path
     *
     * @throws NativeIOException   On failure.
     */
    public static void renameTo(File src, File dst)
            throws IOException {
        if (!nativeLoaded) {
            if (!src.renameTo(dst)) {
                throw new IOException("renameTo(src=" + src + ", dst=" +
                        dst + ") failed.");
            }
        } else {
            renameTo0(src.getAbsolutePath(), dst.getAbsolutePath());
        }
    }

    public static void link(File src, File dst) throws IOException {
        if (!nativeLoaded) {
            HardLink.createHardLink(src, dst);
        } else {
            link0(src.getAbsolutePath(), dst.getAbsolutePath());
        }
    }

    /**
     * A version of renameTo that throws a descriptive exception when it fails.
     *
     * @param src                  The source path
     * @param dst                  The destination path
     *
     * @throws NativeIOException   On failure.
     */
    private static native void renameTo0(String src, String dst)
            throws NativeIOException;

    private static native void link0(String src, String dst)
            throws NativeIOException;

    /**
     * Unbuffered file copy from src to dst without tainting OS buffer cache
     *
     * In POSIX platform:
     * It uses FileChannel#transferTo() which internally attempts
     * unbuffered IO on OS with native sendfile64() support and falls back to
     * buffered IO otherwise.
     *
     * It minimizes the number of FileChannel#transferTo call by passing the the
     * src file size directly instead of a smaller size as the 3rd parameter.
     * This saves the number of sendfile64() system call when native sendfile64()
     * is supported. In the two fall back cases where sendfile is not supported,
     * FileChannle#transferTo already has its own batching of size 8 MB and 8 KB,
     * respectively.
     *
     * In Windows Platform:
     * It uses its own native wrapper of CopyFileEx with COPY_FILE_NO_BUFFERING
     * flag, which is supported on Windows Server 2008 and above.
     *
     * Ideally, we should use FileChannel#transferTo() across both POSIX and Windows
     * platform. Unfortunately, the wrapper(Java_sun_nio_ch_FileChannelImpl_transferTo0)
     * used by FileChannel#transferTo for unbuffered IO is not implemented on Windows.
     * Based on OpenJDK 6/7/8 source code, Java_sun_nio_ch_FileChannelImpl_transferTo0
     * on Windows simply returns IOS_UNSUPPORTED.
     *
     * Note: This simple native wrapper does minimal parameter checking before copy and
     * consistency check (e.g., size) after copy.
     * It is recommended to use wrapper function like
     * the Storage#nativeCopyFileUnbuffered() function in hadoop-hdfs with pre/post copy
     * checks.
     *
     * @param src                  The source path
     * @param dst                  The destination path
     * @throws IOException
     */
    public static void copyFileUnbuffered(File src, File dst) throws IOException {
        if (nativeLoaded && Shell.WINDOWS) {
            copyFileUnbuffered0(src.getAbsolutePath(), dst.getAbsolutePath());
        } else {
            FileInputStream fis = null;
            FileOutputStream fos = null;
            FileChannel input = null;
            FileChannel output = null;
            try {
                fis = new FileInputStream(src);
                fos = new FileOutputStream(dst);
                input = fis.getChannel();
                output = fos.getChannel();
                long remaining = input.size();
                long position = 0;
                long transferred = 0;
                while (remaining > 0) {
                    transferred = input.transferTo(position, remaining, output);
                    remaining -= transferred;
                    position += transferred;
                }
            } finally {
                IOUtils.cleanup(LOG, output);
                IOUtils.cleanup(LOG, fos);
                IOUtils.cleanup(LOG, input);
                IOUtils.cleanup(LOG, fis);
            }
        }
    }

    private static native void copyFileUnbuffered0(String src, String dst)
            throws NativeIOException;
}  

然后应该就可以了

将程序放到hadoop集群运行

1.打包插件

<plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <!-- 指定 mainclass,该配置将最终体现在jar包manifest文件中 -->
                        <mainClass>com.roadom.WordcountDriver</mainClass>
                    </manifest>
                </archive>
                <classesDirectory>
                </classesDirectory>
            </configuration>
        </plugin>
    </plugins>

1.打jar包

 

 

2.将jar包(不包括hadoop lib)上传到hadoop集群

D:\hadoopmr\out\artifacts\hadoopmr_jar\hadoopmr.jar

3.执行命令运行mr作业

./hadoop jar hadoopmr.jar com.neusoft.WordCount /wc.input /out42

可能遇到的问题

hadoop.dll 和 winutils.exe 添加到hadoop解压目录
如果双击winutils.exe弹出错误,需要安装
DirectX_Repair_3.7_Enhanced_XiaZaiBa.zip
更新电脑的dll

三、wordcount详解shuffle机制

十步

1.map前+map+suffer6步+reduce+文件写入

一、map方法执行之前

我们知道,HDFS里的文件是分块存放在Datanode上面的,而我们写的mapper程序也是跑在各个节点上的。这里就涉及到一个问题,哪一个节点上的mapper读哪一些节点上的文件块呢?hadoop会自动将这个文件分片(split),得到好多split,这每一个split放到一个节点的一个mapper里面去读。然后在每一台有mapper任务的节点上都执行了这么一个操作,将分得到的split切割成一行一行的键值对,然后传给map方法。键是这每一行在split中的偏移量,值是每一行得到的字符串。

二、执行map方法

写过wordcount的朋友都知道,这个过程就是读到每一行,切割字符串,生成键值对写出去。

三、shuffle操作(一)

这个过程是在有map任务的节点上完成的

1. partition

将得到的键值对按照一定的规则分组,例如例子中将首字母为a的全部分到一组,将首字母为b的分到一组。这里只是为了讲明白这个方式,进行了过程简化,实际不一定是分为两组,也不一定是按照首字母分组。

2. sort

对每一个组中的键值对根据键的哈希码排序。

3. combine

将具有相同键的键值对合成一个新的键值对,这个新的键值对的键是原来的键,键值是所有键的键值之和。

四、shuffle操作(二)

这个过程是在有reduce任务的节点上完成的。

1. 拉取partition

hadoop决定有多少个reducer的时候会规定有多少个partition,每一个reducer拉取自己要处理的那个分组的全部成员。例如,某台节点要处理所有以a开头的键值对,它就会将所有mapper中的以a开头的那一组全部拉取过来。

2. merge

在每一个reducer上,将具有相同键的键值对生成另外一个新的键值对,键是以前的键,键值是一个以前键值的集合。

3. sort

在每一台reducer节点上,将新生成的键值对进行排序,根据 哈希码值。

五、reduce操作

写过wordcount的朋友都知道,在reduce方法中,hadoop回传过来一个一个的键值对,键是每一个单词,键值就是四中新生成的键值对的键值。执行reduce操作,就是将每一个键值对中的键值累加起来。然后以键值对的形式将结果写出去。

六、文件写入HDFS

在每一台reducer节点上将文件写入,实际上是写成一个一个的文件块,但对外的表现形式是一整个大的结果文件。

四、map和reduce

maptask

一个job的map阶段并行度由客户端在提交job时决定

客户端对map阶段并行度的规划基本逻辑为:
一、将待处理的文件进行逻辑切片(根据处理数据文件的大小,划分多个split),然后每一个split分配一个maptask并行处理实例
二、具体切片规划是由FileInputFormat实现类的getSplits()方法完成

切分规则如下:
1.简单地按照文件的内容长度进行切片
2.切片大小默认是datanode的切块大小128M
3.切片时不是考虑一个整体数据集,而是针对每一个文件单独切片
  比如待处理数据有两个文件:
    file1.txt 200M
   file2.txt 50M
  经过FileInputFormat的切片机制运算后,形成的切片信息如下:
   file1.txt.split1– 0~128M —–maptask
   file1.txt.split2– 128M~200M —–maptask
   file2.txt.split1– 0~50M —–maptask
三、如何改变切片大小(参数设置)
源码是通过这个方法来规划切片大小的

protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }
    minsize:默认值:1;配置参数: mapreduce.input.fileinputformat.split.minsize   
    maxsize:默认值:Long.MAXValue;  配置参数:mapreduce.input.fileinputformat.split.maxsize
    blocksize:hdfs切片大小

调整切片大小结论:
maxsize(切片最大值):
  参数如果调得比blocksize小,则会让切片变小,而且就等于配置的这个参数的值
minsize (切片最小值):
  参数调的比blockSize大,则可以让切片变得比blocksize还大


控制map个数的核心源码

long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
 
//getFormatMinSplitSize 默认返回1,getMinSplitSize 为用户设置的最小分片数, 如果用户设置的大于1,则为用户设置的最小分片数
long maxSize = getMaxSplitSize(job);
 
//getMaxSplitSize为用户设置的最大分片数,默认最大为long 9223372036854775807L
 
long splitSize = computeSplitSize(blockSize, minSize,
                            maxSize);
 
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
    }

由上述代码可以看出在

maxSize默认等于long(长整形)

blockSize默认在hadoop2.0之后为128M

minSize默认等于1

因此默认的切片大小splitSize等于128M也就是说等于块大小

一个切片对应于一个map任务,因此在默认情况下一个块对应于一个map任务。

要想人为控制map的个数可以从minSize和MaxSize入手。

想要增加map的个数,可以将maxSize调整小于blockSize;想要减小map的个数,可以调整minSize>blockSize。

具体调整可以在job配置中增加如下配置

FileInputFormat.setMinInputSplitSize(job, 301349250);//设置minSize
FileInputFormat.setMaxInputSplitSize(job, 10000);//设置maxSize
在实验中,

测试 文件大小 297M(311349250)

块大小128M

测试代码

FileInputFormat.setMinInputSplitSize(job, 301349250);

FileInputFormat.setMaxInputSplitSize(job, 10000);

测试后Map个数为1,由上面分片公式算出分片大小为301349250, 比 311349250小, 理论应该为两个map, 这是为什么呢?在上源码

while (bytesRemaining / splitSize > 1.1D) {
                        int blkIndex = getBlockIndex(blkLocations, length
                                - bytesRemaining);
                        splits.add(makeSplit(path, length - bytesRemaining,
                                splitSize, blkLocations[blkIndex].getHosts()));

                        bytesRemaining -= splitSize;
                    }

可以看出只要剩余的文件大小不超过分片大小的1.1倍, 则会分到一个分片中,避免开两个MAP, 其中一个运行数据太小,浪费资源。

总结,分片过程大概为,先遍历目标文件,过滤部分不符合要求的文件, 然后添加到列表,然后按照文件名来切分分片 (大小为前面计算分片大小的公式, 最后有个文件尾可能合并,其实常写网络程序的都知道), 然后添加到分片列表,然后每个分片读取自身对应的部分给MAP处理

reducetask

1、我们知道map的数量和文件数、文件大小、块大小、以及split大小有关,而reduce的数量跟哪些因素有关呢?

设置mapred.tasktracker.reduce.tasks.maximum的大小可以决定单个tasktracker一次性启动reduce的数目,但是不能决定总的reduce数目。

conf.setNumReduceTasks(4);JobConf对象的这个方法可以用来设定总的reduce的数目,看下Job Counters的统计:

    Job Counters 
        Data-local map tasks=2
        Total time spent by all maps waiting after reserving slots (ms)=0
        Total time spent by all reduces waiting after reserving slots (ms)=0
        SLOTS_MILLIS_MAPS=10695
        SLOTS_MILLIS_REDUCES=29502
        Launched map tasks=2
        Launched reduce tasks=4

确实启动了4个reduce:看下输出:

diegoball@diegoball:~/IdeaProjects/test/build/classes$ hadoop fs -ls  /user/diegoball/join_ou1123
11/03/25 15:28:45 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/03/25 15:28:45 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 5 items
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:28 /user/diegoball/join_ou1123/_SUCCESS
-rw-r--r--   1 diegoball supergroup        124 2011-03-25 15:27 /user/diegoball/join_ou1123/part-00000
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:27 /user/diegoball/join_ou1123/part-00001
-rw-r--r--   1 diegoball supergroup        214 2011-03-25 15:28 /user/diegoball/join_ou1123/part-00002
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:28 /user/diegoball/join_ou1123/part-00003

只有2个reduce在干活。为什么呢?

shuffle的过程,需要根据key的值决定将这条<K,V> (map的输出),送到哪一个reduce中去。送到哪一个reduce中去靠调用默认的org.apache.hadoop.mapred.lib.HashPartitioner的getPartition()方法来实现。
HashPartitioner类:

package org.apache.hadoop.mapred.lib;
 
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;
 
/** Partition keys by their {@link Object#hashCode()}. 
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
 
  public void configure(JobConf job) {}
 
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

numReduceTasks的值在JobConf中可以设置。默认的是1:显然太小。
这也是为什么默认的设置中总启动一个reduce的原因。

返回与运算的结果和numReduceTasks求余。

Mapreduce根据这个返回结果决定将这条<K,V>,送到哪一个reduce中去。

key传入的是LongWritable类型,看下这个LongWritable类的hashcode()方法:

 public int hashCode() {
    return (int)value;
  }

简简单单的返回了原值的整型值。

因为getPartition(K2 key, V2 value,int numReduceTask)返回的结果只有2个不同的值,所以最终只有2个reduce在干活。

HashPartitioner是默认的partition类,我们也可以自定义partition类 :

 package com.alipay.dw.test;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
 
/**
 * Created by IntelliJ IDEA.
 * User: diegoball
 * Date: 11-3-10
 * Time: 下午5:26
 * To change this template use File | Settings | File Templates.
 */
public class MyPartitioner implements Partitioner<IntWritable, IntWritable> {
    public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
        /* Pretty ugly hard coded partitioning function. Don't do that in practice, it is just for the sake of understanding. */
        int nbOccurences = key.get();
        if (nbOccurences > 20051210)
            return 0;
        else
            return 1;
    }
 
    public void configure(JobConf arg0) {
 
    }
}

仅仅需要覆盖getPartition()方法就OK。通过:
conf.setPartitionerClass(MyPartitioner.class);
可以设置自定义的partition类。
同样由于之返回2个不同的值0,1,不管conf.setNumReduceTasks(4);设置多少个reduce,也同样只会有2个reduce在干活。

由于每个reduce的输出key都是经过排序的,上述自定义的Partitioner还可以达到排序结果集的目的:


11/03/25 15:24:49 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 5 items
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:23 /user/diegoball/opt.del/_SUCCESS
-rw-r--r--   1 diegoball supergroup      24546 2011-03-25 15:23 /user/diegoball/opt.del/part-00000
-rw-r--r--   1 diegoball supergroup      10241 2011-03-25 15:23 /user/diegoball/opt.del/part-00001
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:23 /user/diegoball/opt.del/part-00002
-rw-r--r--   1 diegoball supergroup          0 2011-03-25 15:23 /user/diegoball/opt.del/part-00003

part-00000和part-00001是这2个reduce的输出,由于使用了自定义的MyPartitioner,所有key小于20051210的的<K,V>都会放到第一个reduce中处理,key大于20051210就会被放到第二个reduce中处理。
每个reduce的输出key又是经过key排序的,所以最终的结果集降序排列。

但是如果使用上面自定义的partition类,又conf.setNumReduceTasks(1)的话,会怎样? 看下Job Counters:

    Job Counters 
        Data-local map tasks=2
        Total time spent by all maps waiting after reserving slots (ms)=0
        Total time spent by all reduces waiting after reserving slots (ms)=0
        SLOTS_MILLIS_MAPS=16395
        SLOTS_MILLIS_REDUCES=3512
        Launched map tasks=2
        Launched reduce tasks=1

只启动了一个reduce。
(1)、 当setNumReduceTasks( int a) a=1(即默认值),不管Partitioner返回不同值的个数b为多少,只启动1个reduce,这种情况下自定义的Partitioner类没有起到任何作用。
(2)、 若a!=1:
a、当setNumReduceTasks( int a)里 a设置小于Partitioner返回不同值的个数b的话:

    public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
        /* Pretty ugly hard coded partitioning function. Don't do that in practice, it is just for the sake of understanding. */
        int nbOccurences = key.get();
        if (nbOccurences < 20051210)
            return 0;
        if (nbOccurences >= 20051210 && nbOccurences < 20061210)
            return 1;
        if (nbOccurences >= 20061210 && nbOccurences < 20081210)
            return 2;
        else
            return 3;
    }

同时设置setNumReduceTasks( 2)。

于是抛出异常:

  11/03/25 17:03:41 INFO mapreduce.Job: Task Id : attempt_201103241018_0023_m_000000_1, Status : FAILED
java.io.IOException: Illegal partition for 20110116 (3)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:900)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
    at com.alipay.dw.test.KpiMapper.map(Unknown Source)
    at com.alipay.dw.test.KpiMapper.map(Unknown Source)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.mapred.Child.main(Child.java:211) 

某些key没有找到所对应的reduce去处。原因是只启动了a个reduce。

b、当setNumReduceTasks( int a)里 a设置大于Partitioner返回不同值的个数b的话,同样会启动a个reduce,但是只有b个redurce上会得到数据。启动的其他的a-b个reduce浪费了。

c、理想状况是a=b,这样可以合理利用资源,负载更均衡。

yarn基本架构和工作机制

一、yarn基本架构

二、yarn工作机制

(0)Mr 程序提交到客户端所在的节点。

(1)Yarnrunner 向 Resourcemanager 申请一个 Application。

(2)rm 将该应用程序的资源路径返回给 yarnrunner。

(3)该程序将运行所需资源提交到 HDFS 上。

(4)程序资源提交完毕后,申请运行 mrAppMaster。

(5)RM 将用户的请求初始化成一个 task。

(6)其中一个 NodeManager 领取到 task 任务。

(7)该 NodeManager 创建容器 Container,并产生 MRAppmaster。

(8)Container 从 HDFS 上拷贝资源到本地。

(9)MRAppmaster 向 RM 申请运行 maptask 资源。

(10)RM 将运行 maptask 任务分配给另外两个 NodeManager,另两个 NodeManager 分

别领取任务并创建容器。

(11)MR 向两个接收到任务的 NodeManager 发送程序启动脚本,这两个 NodeManager

分别启动 maptask,maptask 对数据分区排序。

(12)MrAppMaster 等待所有 maptask 运行完毕后,向 RM 申请容器,运行 reduce task。

(13)reduce task 向 maptask 获取相应分区的数据。

(14)程序运行完毕后,MR 会向 RM 申请注销自己。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值