Hadoop学习笔记_itr.nexttoken();-CSDN博客

本文链接：https://blog.csdn.net/tankles/article/details/7740041

前段时间，学习了一下Hadoop MapReduce，这里进行简单的总结，主要来自于《Hadoop In Action》。

后续将按照Hadoop处理的顺序整理一些笔记，主要包括：

（1）Hadoop预定义数据类型；

（2）Hadoop InputFormat；

（3）Hadoop Mapper；

（4）Hadoop Partitioner（洗牌）；

（5）Hadoop Reducer；

（6）Hadoop OutputFormt；

（7）Hadoop Driver (驱动程序)；

（8）Hadoop Combiner；

（9）Hadoop Pipes；

（10）Hadoop Streaming；

（11）Aggregate；

其它更高级应用，如数据连接等请自行参阅相关书籍，《Hadoop In Action》、《Hadoop 权威指南》等。

一、Hadoop数据类型

Hadoop预定义了一些类用于实现WritableComparable，主要包括面向基本类型的封装类：

       BooleanWritable 标准布尔变量的封装
       ByteWritable      单字节数的封装
       DoubleWritable   双字节数的封装
       FloatWritable      浮点数的封装
       IntWritable         整数的封装
       LongWritable       Long的封装
       Text             使用UTF8格式的文本封装
       NullWritable        无键值时的占位符

       键和值可以自定义数据类型，Hadoop提供了Writable和WritableComparable接口，Writable实现的是序列化功能，WritableComparable实现了序列化和比较的功能。Hadoop要求键必须实现WritableComparatable<T>接口，值必须至少实现Writable接口。实现Writable接口的类可以是值，不能作为键，而实现WritableComparable<T>接口的类既可以是值也可以是键。

       下面实现一个类，用于表示一个网络的边界，比如代表两个城市之间的航线。Edge类实现了Writable接口的readFields和write方法，它们与java中的DataInput和DataOutput类实现内容的串行化，而Comparable接口实现的是compareTo方法。
   public class Edge implements WritableComparable<Edge>
   {
       private String departureNode ;
       private String arrivalNode ;
       public String getDepartureNode(){ return departureNode; }
       public String getArrivalNode() { return arrivalNode ; }

       @override
       public void readFields(DataInput in) throws IOException
       {
           departureNode = in.readUTF() ;
           arrivalNode = in.readUTF() ;
       }
       @override
       public void write(DataOutput out) throws IOException
       {
           out.writeUTF(departureNode) ;
           out.writeUTF(arrivalNode) ;
       }
       @override
       public int compareTo(Edge 0)
       {
           return (departureNode.compareTo(o.depatrueNode)!=0) ? departureNode.compareTo(o.departureNode): arrivalNode.compareTo(o.arrivalNode) ;
       }
   }

通常使用Hadoop，预定义类型基本满足需要，通过Hadoop数据类型的学习，我们可以自定义数据类型，从而根据需求进行扩充。

二、InputFormat

    Hadoop分割与读取输入文件的方式被定义为InputFormat接口的一个实现中，TextInputFormat是InputFormat的默认实现。
   Hadoop预定义的一些InputFormat类：
       TextInputFormat       在文本文件中的每行一个记录，key为一行的字节偏移，值为一行的内容。ey: LongWritable, Value:Text
       KeyValueTextInputFormat   文本文件中每行是一个记录，以每行的第一个分隔符为界，分隔符前的为键，分割符后的为值，分隔符由key.value.separator.in.input.line中设定，默认为'\t'
       SequenceFileInputFormat<K, V>   用于读取序列文件的InputFormat，键值类型有用户定义，序列文件为hadoop专用的压缩二进制格式，专用于一个MapReduce作业和其它MapReduce作业之间传送数据
       NLineInputFormat   与TextInputFormat相同，但每个分片一定有N行，N由mapred.line.input.format.linespermap中设定，默认为1，key: LongWritable, Value:Text

   MapReduce输入格式由 conf.setInputFormat(KeyValueTextInputFormat.class) ; 设定。


2. 生成一个定制的InputFormat --- InputSplit和RecordReader
   如果Hadoop提供的InputFormat类不能满足需要，则必须编写自定义的InputFormat类，InputFormat主要完成2件事情：
   1）确定所有用于输入数据的文件，并将之分割为输入分片，每个map任务分配一个分片；
   2）提供一个RecordReader对象，循环提取给定分片中的记录，并解析每个记录为预定义类型的键和值；

   public interface InputFormat<K, V>
   {
       InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
       RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException;
   }

   FileInputFormat类实现了InputFormat中的getSplits方法，保留getRecordReader抽象让子类实现，所以在创建InputFormat子类时，最好从负责文件分割的FileInputFormat类中继承，其中有一个isSplitable(FileSystem fs, Path filename)方法，检查是否将给定文件分片，默认返回true，正如压缩文件，如果不对文件进行拆分，则返回false。
   使用FileInputFormat时，只需要关注RecordReader，它负责把一个输入分片解析为一条一条的记录，转变成键值对。

   public interface RecordReader<K, V>
   {
       bool next(K key, V value) throws IOException ;
       K createKey() ;
       V createValue() ;
       long getPos() throws IOException ;
       public void close() throws IOException ;
       float getProgress() throws IOException ;
   }
   预定义的RecordReader有：
       LineRecordReader用于TextInputFormat中每次读取一行，以字节偏移作为键，行的内容作为值。
       KeyValueRecordReader用于KeyValueTextInputFormat

   自定义的RecordReader痛处基于现有实现，并把大多数操作放在next（）函数中。

   public class TimeUrlTextInputFormat extends FileInputFormat<Text, URLWritable>
   {
       public RecordReader<Text, URLWritable> getRecordReader(InputSplit input, JobConf job, Reporter reporter) throws IOException
       {
           return new TimeUrlLineRecordReader(job, (FileSplit)input) ;
       }
   }

   public class URLWritable implements Writable
   {
       protected URL url ;
       public URLWritable(){}
       public URLWritable(URL url){ this.url = url;}
       public void write(DataOutput out) throws IOException
       {
           out.writeUTF(url.toString()) ;
       }
       public void readFields(DataInput in) throws IOException
       {
           url = new URL(in.readUTF()) ;
       }
       public void set(String s) throws MalformadURLException
       {
           url = new URL(s) ;
       }
   }
   class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable>
   {
       private KeyValueLineRecordReader lineReader ;
       private Text lineKey, lineValue ;
       public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws IOException
       {
           lineReader = new KeyValueLineRecordReader(job, split) ;
           lineKey = lineReader.createKey() ;
           lineValue = lineReader.createValue() ;
       }
       public boolean next(Text key, URLWritable value) throws IOException
       {
           if(!lineReader.next(lineKey, lineValue))
           {
               return false ;
           }
           key.set(lineKey) ;
           value.set(lineValue.toString()) ;
           reurn true ;
       }
       public Text createKey()
       {
           return new Text("") ;
       }
       public URLWritable createValue()
       {
           return new URLWritable() ;
       }
       public long getPos() throws IOException
       {
           return lineReader.getPos() ;
       }
       public float getProgress() throws IOExcepton
       {
           reutrn lineReader.getProgress() ;
       }
       public void close() throws IOException
       {
           lineReader.close();
       }
   }

三、Mapper

一个类要作为Mapper，需要继承MapReduceBase基类并实现Mapper接口。

MapReduceBase基类主要提供以下2个函数接口：

            void configure(JobConf job): 函数提取XML配置文件或者应用程序主类中的参数，在数据处理前调用该函数。
            void close(): 作为map任务结束前的最后一个操作，完成守卫工作，如关闭数据库或文件等。
        Mapper接口负责数据处理节点，形式为Mapper<K1, V1, K2, V2>的Java泛型，Mapper只有一个方法map，用于处理一个单独的键值对。
      void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws IOException
        该函数处理一个给定的键值对（K1，V1），生成一个键值对（K2,V2）列表（可能为空）；OutputCollector接收这个映射的输出，Reporter提供对mapper相关信息的记录，行程任务进度。
        Hadoop提供的一些预定义的mapper实现：
   IdentityMapper<K, V>   实现Mapper<K, V, K, V>，将输入直接映射到输出
   InverseMapper<K, V>       实现Mapper<K, V, V, K>，反转键值对
           RegexMapper<K>        实现Mapper<K, Text, Text, LongWritable>，为每个常规表达式的匹配生成一个(match, 1)对
      TokenCountMapper<K>   实现Mapper<K, Text, Text, LongWritable>,当输入的值为分词时，生成一个（token，1）键值对。

单词统计的Map程序如下（Reduce程序见Reducer）：
public class WordCount extends Configured implements Tool
{
   public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
   {
       private final static IntWritable one = new IntWritable(1) ;
       private Text word = new Text() ;
       public void map(LongWritable key, Text value,
                  OutputCollector<Text, IntWritable> output,
                  Reporter reporter) throws IOException
       {
           String line = value.toString() ;
           StringTokenizer itr = new StringTokenizer(line) ;
           while(itr.hasMoreTokens())
           {
               word.set(itr.nextToken()) ;
               output.collect(word, one) ;
           }
       }

}

四、Partitioner

      当使用多个Reducer时，需要将Mapper产生的键值对进行散列来确定发送到哪个Reducer，Hadoop通过HashPartitioner类根据Mapper键强制执行这个策略，所以有时HashPartitioner不能满足需求。例如：使用Edge类分析航班信息决定从各个机场离港的乘客数目，我们希望具有相同离港地的所有Edge送往相同的Reducer，所以将产生错误的统计。这里只要对Edge类的departureNode成员进行散列就可以了。
      一个定制的partitioner只需要实现configure()和getPartition()两个函数，前者将Hadoop对作业的配置应用在partitioner上，后者返回一个介于0和reduce任务数之间的整数，指向键值对将要发送的reducer。
   public class EdgePartitioner implements Partitioner<Edge, Writable>
   {
       @override
       public int getPartition(Edge key, Writable value, int numPartitions)
       {
           return key.getDepartureNode().hashCode() % numPartitions ;
       }
       @override
       public void configure(JobConf conf)
       {
       }
   }

五、Reducer

1、Reducer
   Reducer也必须从MapReduceBase基类扩展，实现Reducer接口中的reduce函数
   void reduce(K2 key,
               Iterator<V2> values,
               OutputCollector<K3, V3> output,
               Reporter reporter) throws IOException

   Hadoop将mapper输出的键值对根据键进行排序，并将相同的键值归并，然后调用reduce函数，并通过迭代进行处理。   OutputCollector接收reduce阶段的输出，并写入输出文件，Reporter可提供对reducer的相关信息的记录，行程任务进度。
   Hadoop提供的预定义Reducer：
   IdentityReducer<K, V>   实现Reducer<K, V, K, V>，将输入直接映射到输出
   LongSumReducer<K>        实现Reducer<K, LongWritable, K, LongWritable>,计算与给定键对应的所有值的和。

单词统计MapReduce程序如下：

   public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
   {
       public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
       {
           int sum = 0 ;
           while(values.hasNext())
           {
               sum += vlaues.next().get() ;
           }
           output.collect(key, new IntWritable(sum)) ;
       }

   }
}

六、OutputFormat

    MapReduce输出数据到文件时，使用的是OutputFormat类，每个reducer仅需将它的输出写到自己的文件中，输出无需分片；RecordWriter对象将输出结果格式化写入文件中。
   Hadoop提供几个标准的OutputFormat实现，通常都是从FileOutputFormat中继承来的，可以通过JobConf中的setOutputFormat定制OutputFormat。默认为TextOutputFormat。
       TextOutputFormat<K, V> 将每个记录写为一行，键和值以字符串的形式写入，并以制表符'\t'分隔，在mapred.textoutputformat.separator中设置，与KeyValueTextInputFormat相对应。
       SequenceFileOutputFormat<K, V> 以Hadoop专有序列文件格式写入键值对，与SequenceFileInputFormat配合使用。

七、Driver

    Hadoop提供GenericOptionsParser支持作业配置参数通过运行时指定，Hadoop框架提供了ToolRunner、Tool和Configured来简化标准配置参数选项的读取。

   public class MyDriver extends Configured implements Tool
   {
       public int run(String[] args) throws Exception
       {
           Configuration conf = getConf() ;
           JobConf job = new JobConf(conf, MyDriver.class) ;
           Path in = new Path(args[10]) ;
           Path out = new Path(args[1]) ;
           FileInputFormat.setInputPaths(job, in) ;
           FileOutputFormat.setOutputPath(job, out) ;

           job.setJobName("MyDriver") ;
           job.setMapperClass(MapperClass.class) ;
           job.setReduerClass(ReducerClass.class) ;

           job.setInputFormat(KeyValueTextInputFormat.class) ;
           job.setOutputFormat(TextOutputFormat.class) ;
           job.setOutputKeyClass(Text.class) ;
           job.setOutputValueClass(Text.class) ;

           job.set("key.value.separtor.in.input.line", ",") ;
           JobClient.runJob(job) ;

           return 0 ;
       }
       public static void main(String[] args) throws Exception
       {
           int res = ToolRunner.run(new Configuration(), new MyDriver, args) ;
           System.exit(res) ;
       }
   }

   run方法中，实例化、配置并传递一个JobConf对象命名的作业给JobClient.runJob()以启动MapReduce作业（JobClient类与JobTracker通信，使作业在集群上启动执行）。JobConf对象保持作业运行需要的全部配置参数。

八、Combiner

    Combiner在数据的转换上必须与Reducer等价，如果去掉combiner，reduer的输出应该相同。对于分配型操作，如：最大值，通常Combiner和Reduer相同，单对于其他操作，如平均值，需要定制combiner，下面提供了一个计算平均值的MapReduce程序及Combiner。

   Combiner必须实现Reducer接口，在Combiner的reduce方法中实现了合并操作。

   计算平均值的Combiner：
   public static class CombinerClass extends MapReduceBase implements Reducer<Text, Text, Text, Text>
   {
       public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
       {
           double sum = 0;
           int count = 0 ;
           while(values.hasNext())
           {
               String[] fieldds = values.next().toString().split(",") ;
               sum += Double.parseDouble(fields[0]) ;
               count += Integer.parseInt(fields[1]) ;
           }
           output.collect(key, new Text(sum+","+count)) ;
       }
   }
   在Driver中设置JobConf的Combiner类，
   job.setCombinerClass(CombierClass.class) ;

   MapReduce框架使用它的次数可以是0、1或者多次。Combiner未必会提供性能，需要监控作业的行为来判断。



   ///计算平均值的MapReduce程序
   public static class MapperClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
   {
       public void map(LongWritable key, Text value, OuputCollector<Text, Text> output, Reporter reporter) throws IOException
       {
           String[] fields = value.toString().split(",", -20) ;
           String country = fields[4] ;
           String numClaims = fields[8] ;
           if(numClaims.length()>0 && !numClaims.startWith("\"))
           {
               output.collect(new Text(country), new Text(numClaims+",1")) ;
           }
       }
   }

   public static class ReduerClass extends MapReduceBase implements Reduer<Text, Text, Text, DoubleWritable>
   {
       public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, DoubleWritable> output, Repoter reporter) throws IOException
       {
           double sum = 0 ;
           int count = 0 ;
           while(values.hasNext())
           {
               String[] fields = values.next().toString().split(",") ;
               sum += Double.parseDouble(fields[0]) ;
               count += Integer.parseInt(fields[1]) ;
           }
           output.collect(key, new DoubleWritable(sum/count)) ;
       }
   }

九、Pipes

    Hadoop Pipes是Hadoop MapReduce的C++接口的代称，使用套接字(socket)作为tasktracker与c++版本的map函数或reduce函数的进程之间的通道。
   通过扩展HadoopPipe说命名空间中定义的Mapper和Reducer两个类，我们定义了map和reduce函数，其中使用了上下文对象(MapContext、ReduceContext和构造函数中使用了TaskContext)，来读取输入和写如输出及通过JobConf来访问作业的配置信息等。
   C++接口中的键和值都按照字节缓冲，采用了stl::string表示。Had哦哦品Pipes::runTask函数连接到java父进程，并在mapper和reducer之间传送数据，runtask函数出入一个Factory参数，由此新建mapper和reducer实例，也可以用重载模板factory设置combiner、partitioner、recored reader和record writer。

   下面为《Hadoop权威指南》中的最高气温的C++ MapReduce程序：
   #include <algorithm>
   #include <limits>
   #include <stdint.h>
   #include <string>

   #include "hadoop/Pipes.hp"
   #include "hadoop/TemplateFactory.hh"
   #include "hadoop/StringUtils.hh"

   using namespace std ;

   class MaxTemperatureMapper : public HadoopPipes::Mapper
   {
       public:
           MaxTemperatureMapper(HadoopPipes::TaskContext& context)
           {
           }
           void map(HadoopPipes::MapContext& context)
           {
               string line = context.getInputValue() ;
               string year = line.substr(15, 4) ;
               string airTemperature = line.substr(87, 5) ;
               string quality = line.substr(92, 1) ;
               if (airTemperature != "+9999" && (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")
               {
                   context.emit(year, airTemperature) ;
               }
           }
   } ;
   class MaxTemperatureReducer : public HadoopPipes::Reducer
   {
   public:
       MaxTemperatureReducer(HadoopPipes::TaskContext& context)
       {
       }
       void reduce(HadoopPipes::ReduceContext& context)
       {
           int maxValue = INT_MIN ;
           while(context.nextValue())
           {
               maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue()) ;
           }
           context.emit(context.getInputKey(), HadoopUtils::toString(maxValue)) ;

       }
   };

   int main(int argc, char** argv)
   {
       return HadoopPipes::runTask(HadoopPipes::TemplateFactor<MaxTemperatureMapper, MaxTemperatureReducer>()) ;
   }

   使用Makefile编译运行程序：
   Makefile文件如下：
   CC = g++
   CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
   max_temperature : max_temperature.cpp
       $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes -lhadooputils -lpthread -g -O2 -o $@
   # end of makefile



   PLATFORM 指定了操作系统、体系结构和数据模型（32bits or 64bits），在32位Linux系统的机器编译运行如下：
   % export PLATFORM=Linux-i386-32
   % make # 编译出max_temperature可执行文件

   Pipes不能在Standalone方式下运行，因为它依赖于Hadoop的分布式缓存机制，该机制只有在HDFS运行时才有效。

   % hadoop fs -put max_temperature bin/max_temperature # 将可执行文件复制到HDFS
   # 使用hadoop pipes命令运行，-program参数标明在HDFS中的可执行文件的URI
   % hadoop pipes \
       -D hadoop.pipes.java.recordreader=true \
       -D hadoop/pipes.java.recordwriter=ture \
       -input sample.txt \
       -output output \
       -program bin/max_temperature

十、Streamming

    Hadoop Streaming使用Unix标准输入/输出作为Hadoop和应用程序之间的接口，所以能够使用任何编程语言通过stdin/stdout来编写MapReduce程序。
   map函数通过标准输入读取数据，并将结果写到标准输出，map函数输出的键/值对是以一个制表符('\t')分割的行；
   reduce函数从标准输入读取，通过制表符('\t')分割的键/值对，该输入已由Hadoop框架根据键排过序，最后将结果写入标准输出。
   streaming方式的脚本很容易在linux shell下执行。

   下面是python使用hadoop streaming的MapReduce程序：
   #!/usr/bin/env python
   # python map函数

   import re, sys
   for line in sys.stdin:
       val = line.strip()
       (year, temp, quality) = (val[15:19], val[87:92], val[92:93])
       if ( temp != "+9999" and re.match("[01459]", quality)):
           print "%s\t%s" % (year, temp)


   #!/usr/bin/env python
   # python reduce函数
   import sys
   (last_key, max_val) = (None, 0)
   for line in sys.stdin:
       (key, val) = line.strip().split('\t')
       if last_key and last_key != key:
           print "%s\t%s" % (last_key, max_val)
           (last_key, max_val) = (key, int(val)
       else:
           (last_key, max_val) = (key, max(max_val, int(val))

   if last_key:
       print "%s\t%s" % (last_key, max_val)

   在shell上运行python程序：
   % cat sample.txt | max_temperature_map.py | sort | max_temperature_reduce.py

   在Hadoop上运行python程序：
   % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-1.0.1.jar \
       -input input   \
       -output output \
       -mapper max_temperature_map.py \
       -reducer max_temperature_reduce.py \
       -file max_temperature_map.py
       -file max_temperature_reduce.py

   # -file选项将脚本程序传输到集群上。

十一、通过Aggregate包使用Streaming

    Hadoop包括一个称谓Aggregate的软件包，它让数据集的汇总统计更为简单，尤其在使用Streaming时。Streaming中Aggregate包作为reduer来做聚集统计，只需要提供一个mapper处理记录，并以特定格式输出，mapper输出的每行格式如下：
   function: key \t value
   function为一个聚合函数的名称（Aggregate包中预定义的函数），紧邻一个冒号和一个以制表符分隔的键值对。
   ValueHistogram的输出格式稍有不同：
       ValueHistogram: key \t value \t count
   count默认为1，可以不输出

   Aggregate包支持的值聚合器函数：
       DoubleValueSum:   一个double值序列的求和
       LongValueSum:   一个long值序列的求和
       LongValueMax：   求一个long值序列的最大值
       LongValueMin:       求一个long值序列的最小值
       StringValueMax:   求一个String序列的字母序最大值
       StringValueMin:   求一个String序列的字母序最小值
       UniqValueCount:   为每个键求但一值的个数
       ValueHistogram:   求每个值的个数、最小值、中值、最大值、平均值和标准方差。


   AttributeCount.py:
   #!/usr/bin/env python
   import sys
   index = int(sys.argv[1])
   for line in sys.stdin:
       fields = line.split(",")
       print "LongValueSum:" + fields[index] + "\t" + "1"

   hadoop jar hadoop-streaming.jar
       -input input
       -output output
       -file AttributeCount.py
       -mapper 'AttributeCount.py 1'
       -reducer aggregate           # 这里指定为aggregate

   例子：
       1）Top K记录
           写程序使得MapReduce作业输出排序的前K个记录
       2）网络流量测量
           获取一个web服务器日志文件，使用Aggregate软件包写一个Streaming程序计算该站点每个小时的流量
       3）两个稀疏矩阵的内乘
           一个向量是一列值，给点2个向量，X=[x1, x2, ...]和Y=[y1, y2, ...]，它们内乘为Z=x1*y1+x2*y2+ ...，当X和Y中很多值为0时，通常表现为稀疏形式：
           1, 0.46
           9, 0.21
           17, 0.93
                 .
                 .
                .
           第一列为向量索引，第二列为值，其它项全部为0.
           写一个Streaming作业来计算2个稀疏向量的内乘，可以在MapReduce作业之后增加一个后处理的步骤完成计算。

       4）时序处理

   《参见Hadoop实战中文版》中第4章结尾联系实例。

        5）统计web日志中小时网络流量--streaming程序

    网络流量测量----获取一个web服务器的日志文件，并使用Aggregate软件包写一个Streaming程序来计算该站点每小时的流量

   分析：因为是使用Aggregate软件包的Streaming程序，所以只有Mapper，且Reducer采用的Aggregate的DoubleValueSum（或LongValueSum），Mapper输出格式为：
       DoubleValueSum：date \t net_traffic
   其中key为日期date（year-month-day-hour），value为一条记录的流量信息

总结：此文根据Hadoop的处理流程顺序大致介绍了几个操作，对Hadoop MapReduce程序的编写和数据分析处理有了基本的了解。