Hadoop 经典面试问题.

Hadoop is a complex framework. Some interview questions can be really simple like “How do you debug a performance issue or a long running job?” but difficult to answer on the spot if you are not prepared. Below are some sample and common questions you would hear in an interview. The Hadoop Developer Interview guide has more complex and difficult to answer scenario based questions than what is shown below.

How do you debug a performance issue or a long running job?

This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.

 Understand the symptom
 Analyze the situation
 Identify the problem areas
 Propose solution

Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case we can try couple of things.

  1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
    1. Enable map output compression – this will further reduce the size of the outputs to be transferred to the reducer.

Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways to reduce the memory usage.

  1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
    1. We can also increase the memory requirements needed by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.

  1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and MAPJOIN or hive.auto.convert.join in Hive
    1. If the data is already sorted you can use USING MERGE which will do a Map Only join
    2. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
      hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data

Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for performance improvement.

  1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your MapReduce output) you can increase the memory available for Map to perform the Shuffle by increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the keys can be performed in memory.
    1. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by setting the mapred.inmem.merge.threshold to 0.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值