hadoop基础知识1

收集自网络几篇文章=====

入门:
知道MapReduce大致流程,map, shuffle, reduce
知道combiner, partition作用,设置compression
搭建hadoop集群,master/slave 都运行那些服务
HDFS,replica如何定位
版本0.20.2->0.20.203->0.20.205, 0.21, 0.23, 1.0. 1
新旧API不同


进阶:. 
Hadoop 参数调优,cluster level: JVM, map/reduce slots, job level: reducer #,
memory, use combiner? use compression?
pig latin, Hive 简单语法
HBase, zookeeper 搭建


最新:
关注cloudera, hortonworks blog
next generation MR2框架
高可靠性, namenode: avoid single point of failure.
数据流系统:streaming storm(twitter).


演练算法:
wordcount
字典同位词
翻译sql语句 select count(x) from a group by b;
Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this  
Speculative Execution 
 
Q18. How does speculative execution works in Hadoop  
Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. 
Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it  
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it  
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
 
Q.24 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application  
This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution
 
Q25. Have you ever used Counters in Hadoop. Give us an example scenario 
Anybody who claims to have worked on a Hadoop project is expected to use counters
 
Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job  
Yes, The input format class provides methods to add multiple directories as input to a Hadoop job
 
Q27. Is it possible to have Hadoop job output in multiple directories. If yes then how  
Yes, by using Multiple Outputs class
 
Q28. What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.
1、编写一只爬虫
要求:1、可配置要爬取的网页URL格式
           2、可定制要爬取的深度
          3、对爬取下来的页面可由后期调用的程序进行存储(即事件)
2、现有大批量url需要爬取,其中url的解析以及n层抓取已有服务端实现(多级深度),现在给定若干台服务器以及不断增加的客户机,各服务端的url任务已有机制保证平衡,爬虫url任务由客户机向服务器请求并完成。
     请设计一个分布式框架,以完成单层的ur抓取,并且每个服务器都能尽可能平均的获取客户机资源。
     注意:服务器可能当机;


1、设计一套系统,使之能够从不断增加的不同的数据源中,提取指定格式的数据。
要求:1、运行结果要能大致得知提取效果,并可据此持续改进提取方法;
            2、由于数据来源的差异性,请给出可弹性配置的程序框架;
            3、数据来源可能有Mysql,sqlserver等;
            4、该系统具备持续挖掘的能力,即,可重复提取更多信息;
2、编写一个工具,该工具能够根据不同的文档模板,生成提取格式化数据的正则表达式
=====
1、hadoop运行的原理?


2、mapreduce的原理?


3、HDFS存储的机制?


4、举一个简单的例子说明mapreduce是怎么来运行的 ?


5、面试的人给你出一些问题,让你用mapreduce来实现?


      比如:现在有10个文件夹,每个文件夹都有1000000个url.现在让你找出top1000000url。


6、hadoop中Combiner的作用?
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值