Something stuff of Apress-Pro Hadoop(be going on...)


Getting started with hadoop core

    Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer fit on
a single cost-effective computer.


    A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is supported
by the fastest machines available, and usually the only limiting factor is your budget.

   An alternative solution is to build a high-availability cluster.



MapReduce Model:

• Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel.
• Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity.





MapReduce Application is a specialized web crawler which received as input large sets of URLs.Job had serverl steps:

   1,Ingest Urls.

   2,Normalize the urls.

   3,eliminate duplicate urls.

   4,filter all urls.

   5,fetch the urls.

   6,fingerprint the content items.

   7,update the recently sets.

   8,prepare the work list for next application.


The Hadoop-based application was running faster and well.



 Introducing Hadoop

    this is a top-level project in apache,provoding and supporting development of open source software that supplies a framework for developments of highly scalable distributed computing applications.

    The two fundamental pieces of hadoop are the mapreduce framework and hadoop distributed file system(HDFS).

     The mapreduce framework required a shared file system such as HDFS,S3,NFS,GFS..but the HDFS is the best suitable.


Introducing MapReduce


    required as following:

    1,The locations in the distributed file system of input.

    2,the locations in the distributed file system for output.

    3,the input format.

    4,the output format.

    5,the class contains the map function.

    6,optionally,the class contains the reduce function.

    7,the jar fils containing the above class.


if a job does not need a reduce function,the framework will partition  the input,and schedule and execute maps tasks across the cluster.if requested, it will sort the results of the map task and execute the map reduce with the map output.the final output will be moved the output directory and the state of job report user.


Managing the mapreduce:

   there are two process to manage jobs:

    TaskTracker manages the execution of individual map and reduce task on a compute node in the cluster.

    JobTracker accepts job submission provides job monitoring and control,and manager the distribution of tasks to the tasktracker nodes.

Note: one nice feature is that you can add tasktracker to the cluster when a job is running and have the job spread to the new node.


 Introducing HDFS



 HDFS is designed for use for mapreduce jobs that  read input in large churks of input and write large churk of output.this is referred as replication in hadoop.


Installing Hadoop

    the prerequisites:

    1,fedora 8


    3,hadoop 0.19 or later

 Go to the Hadoop download site at find  the gz file,download the file,tar the file,then export HADOOP_HOME=[yourdirectory],export PATH=${HADOOP_HOME}/bin:${PATH}.

    last,check all..

Running examples and tests

      domonstrate all examples...:)


 Chapter 2 the basices of mapreduce job

the chapter





 the user is responsiable for handing the job setup,specifying the inputs locations,specifying .


there is a simple example:

package com.apress.hadoopbook.examples.ch2;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.apache.hadoop.mapred.lib.IdentityReducer;
import org.apache.log4j.Logger;
/** A very simple MapReduce example that reads textual input where
* each record is a single line, and sorts all of the input lines into
* a single output file.
* The records are parsed into Key and Value using the first TAB
* character as a separator. If there is no TAB character the entire
* line is the Key. *
* @author Jason Venner
public class MapReduceIntro {
protected static Logger logger = Logger.getLogger(MapReduceIntro.class);
* Configure and run the MapReduceIntro job.
* @param args
* Not used.
public static void main(final String[] args) {
try {
/** Construct the job conf object that will be used to submit this job
* to the Hadoop framework. ensure that the jar or directory that
* contains MapReduceIntroConfig.class is made available to all of the
* Tasktracker nodes that will run maps or reduces for this job.
final JobConf conf = new JobConf(MapReduceIntro.class);

* Take care of some housekeeping to ensure that this simple example
* job will run
* This section is the actual job configuration portion /**
* Configure the inputDirectory and the type of input. In this case
* we are stating that the input is text, and each record is a
* single line, and the first TAB is the separator between the key
* and the value of the record.
/** Inform the framework that the mapper class will be the
* {@link IdentityMapper}. This class simply passes the
* input Key Value pairs directly to its output, which in
* our case will be the shuffle.
/** Configure the output of the job to go to the output
* directory. Inform the framework that the Output Key
* and Value classes will be {@link Text} and the output
* file format will {@link TextOutputFormat}. The
* TextOutput format class joins produces a record of
* output for each Key,Value pair, with the following
* format. Formatter.format( "%s\t%s%n", key.toString(),
* value.toString() );.
* In addition indicate to the framework that there will be
* 1 reduce. This results in all input keys being placed
* into the same, single, partition, and the final output
* being a single sorted file.

/** Inform the framework that the reducer class will be the {@link
* IdentityReducer}. This class simply writes an output record key,
* value record for each value in the key, valueset it receives as
* input. The value ordering is arbitrary.
logger .info("Launching the job.");
/** Send the job configuration to the framework and request that the
* job be run.
final RunningJob job = JobClient.runJob(conf);"The job has completed.");
if (!job.isSuccessful()) {
logger.error("The job failed.");
}"The job completed successfully.");
} catch (final IOException e) {
logger.error("The job has failed due to an IO Exception", e);



the framework will make one call to your map function for echo record for your input.


the framework will calls the reduce function one time for each unique key.


If you require the output of your job to be sorted, the reducer function must pass the key
objects to the output.collect() method unchanged. The reduce phase is, however, free to
output any number of records, including zero records, with the same key and different values.
This particular constraint is also why the map tasks may be multithreaded, while the reduce
tasks are explicitly only single-threaded.


Special the input formats:




keyvaluetextinputformat and sequenceinputformat are the most commonly used input formats.


Setting the out format:


Configuring the reduce phase:

        Five pieces:

    The number of reduce tasks;

    The class supplying the reduce method;

    The input/output key and value types for reduce task;

    The output file type for reduce task output;



Creating a custom mapper and reducer

    As you're seen,your first hadoop job produced sorted output,but the sorting was not suitable.Let's work out what is required to sort,using custom mapper.


creating a custom mapper:

you must change your configuration and provide a custom class .this is done by two calls on the jobconf.class:

    conf.setOutputKeyClass(xxx.class):informs the type;



as blow: you must informs:


/** Transform the input Text, Text key value
* pairs into LongWritable, Text key/value pairs.
public class TransformKeysToLongMapperMapper
extends MapReduceBase implements Mapper<Text, Text, LongWritable, Text>


Creating a custom reducer:

    after your work with the custom mapper in the preceding sections,creating a custom reducer will seem familiar.


so add the following single line:



public class MergeValuesToCSVReducer<K, V>
extends MapReduceBase implements Reducer<K, V, K, Text> {




Why do the mapper and reducer extend MapReduceBase?


The class provides basic implementations of two additinal methods the required of a mapper or reducer by the framework..


/** Default implementation that does nothing. */
public void close() throws IOException {
/** Default implementation that does nothing. */
public void configure(JobConf job) {


the configure is the way to access to the jobconf..

the close is the way to close resource or other things.


The makeup of cluster

   In the context of Hadoop, a node/machine running the TaskTracker or DataNode server is considered a slave node. It is common to have nodes that run both the TaskTracker and
DataNode servers. The Hadoop server processes on the slave nodes are controlled by their respective masters, the JobTracker and NameNode servers.



Hadoop 项目主页:   一个分布式系统基础架构,由Apache基金会开发。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。 起源:Google的集群系统   Google的数据中心使用廉价的Linux PC机组成集群,在上面运行各种应用。即使是分布式开发的新手也可以迅速使用Google的基础设施。核心组件是3个:   1、GFS(Google File System)。一个分布式文件系统,隐藏下层负载均衡,冗余复制等细节,对上层程序提供一个统一的文件系统API接口。Google根据自己的需求对它进行了特别优化,包括:超大文件的访问,读操作比例远超过写操作,PC机极易发生故障造成节点失效等。GFS把文件分成64MB的块,分布在集群的机器上,使用Linux的文件系统存放。同时每块文件至少有3份以上的冗余。中心是一个Master节点,根据文件索引,找寻文件块。详见Google的工程师发布的GFS论文。   2、MapReduce。Google发现大多数分布式运算可以抽象为MapReduce操作。Map是把输入Input分解成中间的Key/Value对,Reduce把Key/Value合成最终输出Output。这两个函数由程序员提供给系统,下层设施把Map和Reduce操作分布在集群上运行,并把结果存储在GFS上。   3、BigTable。一个大型的分布式数据库,这个数据库不是关系式的数据库。像它的名字一样,就是一个巨大的表格,用来存储结构化的数据。   以上三个设施Google均有论文发表。 开源实现   Hadoop是项目的总称,起源于作者儿子的一只吃饱了的大象的名字。主要是由HDFS、MapReduce和Hbase组成。   HDFS是Google File System(GFS)的开源实现。   MapReduce是Google MapReduce的开源实现。   HBase是Google BigTable的开源实现。   这个分布式框架很有创造性,而且有极大的扩展性,使得Google在系统吞吐量上有很大的竞争力。因此Apache基金会用Java实现了一个开源版本,支持Fedora、Ubuntu等Linux平台。目前Hadoop受到Yahoo的支持,有Yahoo员工长期工作在项目上,而且Yahoo内部也准备使用Hadoop代替原来的的分布式系统。   Hadoop实现了HDFS文件系统和MapRecue。用户只要继承MapReduceBase,提供分别实现Map和Reduce的两个类,并注册Job即可自动分布式运行。   目前Release版本是0.20.1。还不成熟,但是已经集群规模已经可以达到4000个节点,是由Yahoo!实验室中构建的。下面是此集群的相关数据:   • 4000 节点   • 2 x quad core Xeons@2.5ghz per 节点   • 4 x 1TB SATA Disk per 节点   • 8G RAM per 节点   • 千兆带宽 per 节点   • 每机架有40个节点   • 每个机架有4千兆以太网上行链路   • Redhat Linux AS4 ( Nahant update 5 )   • Sun Java JDK1.6.0_05 - b13   • 所以整个集群有30000多个CPU,近16PB的磁盘空间!   HDFS把节点分成两类:NameNode和DataNode。NameNode是唯一的,程序与之通信,然后从DataNode上存取文件。这些操作是透明的,与普通的文件系统API没有区别。   MapReduce则是JobTracker节点为主,分配工作以及负责和用户程序通信。   HDFS和MapReduce实现是完全分离的,并不是没有HDFS就不能MapReduce运算。   Hadoop也跟其他云计算项目有共同点和目标:实现海量数据的计算。而进行海量计算需要一个稳定的,安全的数据容器,才有了Hadoop分布式文件系统(HDFS,Hadoop Distributed File System)。   HDFS通信部分使用org.apache.hadoop.ipc,可以很快使用RPC.Server.start()构造一个节点,具体业务功能还需自己实现。针对HDFS的业务则为数据流的读写,NameNode/DataNode的通信等。   MapReduce主要在org.apache.hadoop.mapred,实现提供的接口类,并完成节点通信(可以不是hadoop通信接口),就能进行MapReduce运算。   目前这个项目还




