读书笔记《Hadoop权威指南第4版(Hadoop The Definitive Guide 4th)》

最新推荐文章于 2020-12-16 20:57:31 发布

Darren.P

最新推荐文章于 2020-12-16 20:57:31 发布

阅读量754

点赞数

分类专栏：大数据文章标签： Hadoop 权威指南 The Definitive Guide

大数据专栏收录该内容

12 篇文章 1 订阅

订阅专栏

Chapter 1 Meet Hadoop

Data Storage and Analysis

The problem is simple: although the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up.
并行读取会缩短读取时间
The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different approach, as you shall see later.
The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, trans‐forming it into a computation over sets of keys and values.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, because it runs on commodity hardware and is open source, Hadoop is affordable.
brute-force approach

Comparison with Other Systems

MapReduce is a batch query processor, and the ability to run an ad hoc query（特定查询） against your whole dataset and get the results in a reasonable time is transformative
By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers. You can read more about how Rackspace uses Hadoop in Chapter 14.

RDBMS

Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?（这一段写的太好了）
The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency（延迟） of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth（磁盘带宽）.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a tradi- tional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.（少量更新用B-Tree，大量不适用）

Chapter 2:MapReduce

例子

先介绍了气温监测日志的数据格式，然后用MapReduce找出每年的最高温度。
在这里插入图片描述
形象的说明了各个阶段做的事情。map处理后在交给reduce处理之前，将key-value对排序并分组。
The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
每一年后面跟了一个列表，reduce需要遍历列表找出最大值。

代码实现

Mapper是一个泛型：
public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
重写了map方法
}
Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer).
Reducer也是一个泛型：
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
重写了reduce方法
}

扩展到分布式

Hadoop runs the job by dividing it into tasks, of which there are two types:map tasks and reduce tasks.
Hadoop把job分为两种任务：map任务和reduce任务。有两种结点来控制job的执行：一个jobtracker和若干个tasktracker。
Hadoop把输入分片(split)，每片生成一个map任务，对片中的每条记录执行map任务。片很小的话，每个片执行的时间缩短，并行执行效率高。片太小的话管理起来麻烦，默认片大小为HDFS的块大小（64MB）。可以保证片能够存储在一个单一的结点上不用跨rack。
Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
Map任务将结果输出到本地磁盘而不是HDFS，否则要复制多份。
Map tasks write their output to local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away.So storing it in HDFS, with replication, would be overkill.
Reduce任务的输入来自于多个map任务，所以map的输出必须跨网络跨结点传输。
the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability. As explained in Chapter 3, for each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consumes.
在这里插入图片描述
This diagram makes it clear why the data flow between map and reduce tasks is collo- quially known as “the shuffle,” as each reduce task is fed by many map tasks.
有多个reducer时，map任务partition其输出，每个partition对应一个reducer。
许多任务受限于集群的带宽，因此缩减map和reduce任务之间的传输量是有用的。Hadoop允许用户对map的输出进行combiner function优化，但是不保证调用次数。
job.setCombinerClass(MaxTemperatureReducer.class);

Chapter 3: The Hadoop Distributed Filesystem

HDFS Concepts

Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
The first benefit is the most obvious: a file can be larger than any single disk in the network.Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem.Furthermore, blocks fit well with replication for providing fault tolerance and availa- bility.
Like its disk filesystem cousin, HDFS’s fsck command understands blocks. For exam- ple, running:
% hadoop fsck -files -blocks
will list the blocks that make up each file in the filesystem. (See also “Filesystem check
(fsck)” on page 281.)

Rather than measuring bandwidth between nodes, which can be difficult to do in prac- tice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center
• Nodes in different data centers

As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the Data Streamer, whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline
Hadoop’s strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Once the replica locations have been chosen, a pipeline is built, taking network topol- ogy into account.

This coherency model has implications for the way you design applications. With no calls to sync(), you should be prepared to lose up to a block of data in the event of client or system failure. For many applications, this is unacceptable, so you should call sync() at suitable points, such as after writing a certain number of records or number of bytes.

Chapter 5: Hadoop I/O

Data Integrity

The usual way of detecting corrupted data is by computing a checksum（校验和）for the data when it first enters the system, and then whenever it is transmitted across a channel that is unreliable and hence capable of corrupting the data.
A separate checksum is created for every io.bytes.per.checksum bytes of data. The default is 512 bytes, and since a CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.A client writing data sends it to a pipeline of datanodes (as explained in Chapter 3), and the last datanode in the pipeline verifies the checksum. If it detects an error, the client receives a ChecksumException, a subclass of IOException.
The way this works is that if a client detects an error when reading a block, it reports the bad block and the datanode it was trying to read from to the namenode before throwing a ChecksumException.

Serialization

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message.

Chapter 6: Developing a MapReduce Application

写MapReduce程序按如下顺序：写map和reduce函数；写driver program运行job; 在IDE下运行小部分数据验证并调试；在集群上运行会暴露更多问题；在集群上调试面临诸多挑战，下面看一些通用的技巧。
After the program is working, you may wish to do some tuning, first by running through some standard checks for making MapReduce programs faster and then by doing task profiling. Profiling distributed programs is not trivial, but Hadoop has hooks to aid the process.
（Hadoop配置文件，书中的配置文件有点陈旧，请参看官方最新API）
Writing a Unit Test
map和reduce的输出都是Context，所以要将Context代替为mock来验证。
since outputs are written to a Context (or an OutputCollector in the old API), rather than simply being returned from the method call, the Context needs to be re- placed with a mock so that its outputs can be verified.
Running Locally on Test Data
Running on a Cluster: Job ID(job_200904110811_0002 和 Counters都会打印出来。
tasks属于一个job，Task ID(task_200904110811_0002_m_000003
attempt_200904110811_0002_m_000003_0 is the first attempt at running task task_200904110811_0002_m_000003.
The final count in the task attempt ID is incremented by 1,000 if the job is restarted after the jobtracker is restarted and recovers its running jobs (although this behavior is disabled by default—see “Jobtracker Failure” on page 202).
在这里插入图片描述
The local job runner is a very different environment from a cluster, and the data flow patterns are very different. Optimizing the CPU performance of your code may be pointless if your MapReduce job is I/O-bound (as many jobs are).

Chapter 7: How MapReduce Works

Anatomy of a MapReduce Job Run

You can run a MapReduce job with a single method call: submit() on a Job object (note that you can also call waitForCompletion(), which will submit the job if it hasn’t been submitted already, then wait for it to finish).

MapReduce1(已经废弃）

The whole process is illustrated in Figure 6-1. At the highest level, there are four inde- pendent entities:
• The client, which submits the MapReduce job.
• The jobtracker,which coordinates the job run.The jobtracker is a Java application
whose main class is JobTracker.
• The task trackers,which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is TaskTracker.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used
for sharing job files between the other entities.
在这里插入图片描述
1、启动job；
2、job启动后，向JobTracker申请一个ID。检查job输出目录如果已有则失败。
将输入分片，如果输入目录不存在则失败；
3、将job JAR，输入分片等资源复制到HDFS目录(named after the job ID)；
4、通知JobTracker执行任务；
5、初始化job；
6、获取分片；
7、Tasktracker通过心跳机制跟Jobtracker保持通信；
8、Jobtracker有一个优先级队列，选择一个job，分配一个任务；
Tasktracker有固定数量的slots给map和reduce任务。
出于data locality considerations考虑，对于map任务，jobtracker需要考虑tasktracker跟输入分片位置尽可能的邻近。理想情况下，任务是data-local的，在分片存储的位置进行计算。
9、TaskRunner新建一个Child JVM；
10、JVM运行任务。
In the case of Streaming, the Streaming task communicates with the process (which may be written in any language) using standard input and output streams. The Pipes task, on the other hand, listens on a socket and passes the C++ process a port number in its environment, so that on startup, the C++ process can establish a persistent socket connection back to the parent Java Pipes task.

For reduce tasks, it’s a little more complex, but the system can still estimate the pro- portion of the reduce input processed. It does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle (see “Shuffle and Sort” on page 163). For example, if the task has run the reducer on half its input, then the task’s progress is 5⁄6, since it has completed the copy and sort phases (1⁄3 each) and is half way through the reduce phase (1⁄6).

YARN(MapReduce2)

上述MapReduce1在结点多时会遇到瓶颈，因此出现了YARN。
For very large clusters in the region of 4000 nodes and higher, the MapReduce system described in the previous section begins to hit scalability bottlenecks, so in 2010 a group at Yahoo! began to design the next generation of MapReduce. The result was YARN, short for Yet Another Resource Negotiator (or if you prefer recursive ancronyms, YARN Application Resource Negotiator).
YARN meets the scalability shortcomings of “classic” MapReduce by splitting the re- sponsibilities of the jobtracker into separate entities. The jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task progress monitoring.
记住ResourceManager, NodeManager, MRApplicationMaster
YARN将jobtracker的两个任务交给ResourceManager和ApplicationManager分别管理。
YARN separates these two roles into two independent daemons: a resource manager to manage the use of resources across the cluster, and an application master to manage the lifecycle of applications running on the cluster. The idea is that an application master negotiates with the resource manager for cluster resources—described in terms of a number of containers each with a certain memory limit—then runs application- specific processes in those containers. The containers are overseen by node managers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.
在这里插入图片描述
过程：提交Job后向ResourceManager申请一个新的applicationID，检查output目录是否已存在，计算输入input splits，拷贝资源文件，向ResourceManager提交application。
ResourceManager收到submitApplication()请求后交给YARN scheduler，分配一个container，在其上启动MRApplicationMaster，由NodeManger管理。为每个分片创建一个map任务。如果只是一个小任务（10个map，1个reduce）就在本地完成。
如果任务大，就像ResourceManager申请资源，map任务优先于reduce(5%的map任务完成后才能申请）。
MRApplicationMaster通知分配的NodeManager启动新的container。YarnChild获取资源来运行MapTask或者ReduceTask。
The way memory is allocated is different to MapReduce 1, where tasktrackers have a fixed number of “slots”, set at cluster configuration time, and each task runs in a single slot. Slots have a maximum memory allowance, which again is fixed for a cluster, and which leads both to problems of under utilization when tasks use less memory (since other waiting tasks are not able to take advantage of the unused memory) and problems of job failure when a task can’t complete since it can’t get enough memory to run correctly.
In YARN, resources are more fine-grained, so both these problems can be avoided. In particular, applications may request a memory capability that is anywhere between the minimum allocation and a maximum allocation, and which must be a multiple of the minimum allocation. Default memory allocations are scheduler-specific, and for the capacity scheduler the default minimum is 1024 MB (set by yarn.schedu ler.capacity.minimum-allocation-mb), and the default maximum is 10240 MB (set by yarn.scheduler.capacity.maximum-allocation-mb). Thus, tasks can request any memory allocation between 1 and 10 GB (inclusive), in multiples of 1 GB (the scheduler will round to the nearest multiple if needed), by setting mapreduce.map.memory.mb and map reduce.reduce.memory.mb appropriately.

Shuffle and Sort

MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.The shuffle is an area of the codebase where refinements and improvements are continually being made, so the following description necessarily conceals many details (and may change over time, this is for version 0.20). In many ways, the shuffle is the heart of MapReduce, and is where the “magic” happens.

The Map Side

When the map function starts producing output, it is not simply written to disk. The process is more involved, and takes advantage of buffering writes in memory and doing some presorting for efficiency reasons.
map的输出不是简单的写入磁盘，会写入内存并因效率原因做presorting.
在这里插入图片描述

Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default (the size can be tuned by changing the mapreduce.task.io.sort.mb property). When the contents of the buffer reach a certain threshold size (mapre duce.map.sort.spill.percent, which has the default value 0.80, or 80%), a back‐ ground thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete. Spills are written in round-robin fashion to the directories specified by the mapreduce.cluster.local.dir property, in a job- specific subdirectory.
在写入磁盘前，先把数据分区（divides the data into partitions）。在每个分区中后台线程安键进行内存中排序。如果有combiner function就在sort后进行。

The Reduce Side

The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.(A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.)
When all the map outputs have been copied, the reduce task moves into the sort phase , which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 , there would be five rounds. Each round would merge 10 files into 1, so at the end there would be 5 intermediate files.

Darren.P

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
读书笔记《Hadoop权威指南第4版(Hadoop The Definitive Guide 4th)》

RAIDbrute-force approachComparison with Other SystemsMapReduce is a batch query processor, and the ability to run an ad hoc query（特定查询） against your whole dataset and get the results in a reasonabl...
复制链接

扫一扫

专栏目录