hadoop 10 tip(转载)

10 MapReduce Tips
This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14.

1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses.

Java: Good for: speed; control; binary data; working with existing Java or MapReduce libraries.
Pipes: Good for: working with existing C++ libraries.
Streaming: Good for: writing MapReduce programs in scripting languages.
Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: Python/Ruby programmers who want quick results, and are comfortable with the MapReduce abstraction.
Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested data.
While there are no hard and fast rules, in general, we recommend using pure Java for large, recurring jobs, Hive for SQL style analysis and data warehousing, and Pig orStreaming for ad-hoc analysis.



2. Consider your input data “chunk” size
Are you generating large, unbounded files, like log files? Or lots of small files, like image files? How frequently do you need to run jobs?

Answers to these questions determine how your store and process data using HDFS. For large unbounded files, one approach (until HDFS appends are working) is to write files in batches and merge them periodically. For lots of small files, see The Small Files Problem.HBase is a good abstraction for some of these problems too, so may be worth considering.

3. Use SequenceFile and MapFile containers
SequenceFiles are a very useful tool. They are:

Splittable. So they work well with MapReduce: each map gets an independent split to work on.
Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.

However, both are Java-centric, so you can’t read them with non-Java tools. The Thriftand Avro projects are the places to look for language-neutral container file formats. (For example, see Avro’s DataFileWriter although there is no MapReduce integration yet.)

4. Implement the Tool interface
If you are writing a Java driver, then consider implementing the Tool interface to get the following options for free:

-D to pass in arbitrary properties (e.g. -D mapred.reduce.tasks=7 sets the number of reducers to 7)
-files to put files into the distributed cache
-archives to put archives (tar, tar.gz, zip, jar) into the distributed cache
-libjars to put JAR files on the task classpath
public class MyJob extends Configured implements Tool {

public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), MyJob.class);
// run job ...
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new MyJob(), args);
System.exit(res);
}
}By taking this step you also make your driver more testable, since you can inject arbitrary configurations using Configured’s setConf() method.

5. Chain your jobs
It’s often natural to split a problem into multiple MapReduce jobs. The benefits are a better decomposition of the problem into smaller, more-easily understood (and more easily tested) steps. It can also boost re-usability. Also, by using the Fair Scheduler, you can run a small job promptly, and not worry that it will be stuck in a long queue of (other people’s) jobs.

ChainMapper and ChainReducer (in 0.20.0) are worth checking out too, as they allow you to use smaller units within one job, effectively allowing multiple mappers before and afterthe (single) reducer: M+RM*.

Pig and Hive do this kind of thing all the time, and it can be instructive to understand what they are doing behind the scenes by using EXPLAIN, or even by reading their source code, to make you a better MapReduce programmer. Of course, you could always use Pig or Hive in the first place…

6. Favor multiple partitions
We’re used to thinking that the output data is contained in one file. This is OK for small datasets, but if the output is large (more than a few tens of gigabytes, say) then it’s normally better to have a partitioned file, so you take advantage of the cluster parallelism for the reducer tasks. Conceptually, you should think of your output/part-*files as a single “file”: the fact it is broken up is an implementation detail. Often, the output forms the input to another MapReduce job, so it is naturally processed as a partitioned output by specifying output as the input path to the second job.

In some cases the partitioning can be exploited. CompositeInputFormat, for example, uses the partitioning to do joins efficiently on the map-side. Another example: if your output is a MapFile, you can use MapFileOutputFormat’s getReaders() method to do lookups on the partitioned output.

For small outputs you can merge the partitions into a single file, either by setting the number of reducers to 1 (the default), or by using the handy -getmerge option on the filesystem shell:

% hadoop fs -getmerge hdfs-output-dir local-fileThis concatenates the HDFS files hdfs-output-dir/part-* into a single local file.

7. Report progress
If your task reports no progress for 10 minutes (see the mapred.task.timeout property) then it will be killed by Hadoop. Most tasks don’t encounter this situation since they report progress implicitly by reading input and writing output. However, some jobs which don’t process records in this way may fall foul of this behavior and have their tasks killed. Simulations are a good example, since they do a lot of CPU-intensive processing in each map and typically only write the result at the end of the computation. They should be written in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:

Call setStatus() on Reporter to set a human-readable description of
the task’s progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (and making progress)
8. Debug with status and counters
Using the Reporter’s setStatus() and incrCounter() methods is a simple but effective way to debug your jobs. Counters are often better than printing to standard error since they are aggregated centrally, and allow you to see how many times a condition has occurred.

Status descriptions are shown on the web UI so you can monitor a job and keep and eye on the statuses (as long as all the tasks fit on a single page). You can send extra debugging information to standard error which you can then retrieve through the web UI (click through to the task attempt, and find the stderr file).

You can do more advanced debugging with debug scripts.

9. Tune at the job level before the task level
Before you start profiling tasks there are a number of job-level checks to run through:

Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.
Have you set a combiner (if your algorithm allows it)?
Have you enabled intermediate compression? (See JobConf.setCompressMapOutput(), or equivalently mapred.compress.map.output).
If using custom Writables, have you provided a RawComparator?
Finally, there are a number of low-level MapReduce shuffle parameters that you can tune to get improved performance.
10. Let someone else do the cluster administration
Getting a cluster up and running can be decidely non-trivial, so use some of the free tools to get started. For example, Cloudera provides an online configuration tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as well as scripts to run on Amazon EC2.

Do you have a MapReduce tip to share? Please let us know in the comments.

Monday, May 18th, 2009 at 1:40 pm by Tom White, filed under general, hadoop, mapreduce
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hadoop是一个开源的大数据处理框架,最初由Apache软件基金会开发,用于在廉价的硬件集群上进行分布式存储和计算。Hadoop主要包含两个核心组件:Hadoop Distributed File System (HDFS) 和 MapReduce。 在Windows 10环境下安装Hadoop,通常需要进行以下几个步骤: 1. **安装Java**: Hadoop运行在Java环境中,确保你已经安装了JDK(Java Development Kit)。 2. **下载Hadoop**: 官方网站(https://hadoop.apache.org/releases.html)提供了预编译的二进制包。选择适合Windows的版本下载,如Hadoop 2.x或Hadoop 3.x。 3. **配置环境变量**: 将Hadoop的bin目录添加到系统的PATH环境变量中,以便在命令行中轻松访问Hadoop工具。 4. **安装HDFS**: 如果你只想使用HDFS作为文件系统,可以考虑使用Hadoop的HDFS standalone模式。否则,你可能需要安装整个Hadoop集群,包括NameNode和DataNodes。 5. **启动服务**: 运行Hadoop的start-dfs.sh或hdfs.cmd脚本来启动HDFS集群。 6. **验证安装**: 使用命令行工具如`hadoop fs -ls`来测试HDFS是否正常工作。 7. **MapReduce组件**: 如果你计划使用MapReduce,还需要安装YARN(Yet Another Resource Negotiator),以及配置Hadoop的配置文件,如core-site.xml和mapred-site.xml。 **相关问题--:** 1. Hadoop在大数据处理中的作用是什么? 2. 如何在Windows上配置Hadoop的环境变量? 3. HadoopMapReduce是如何工作的? 4. HDFS standalone模式适用于哪些场景? 5. YARN在Hadoop架构中的角色是什么?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值