hadoop基础知识1

最新推荐文章于 2024-05-01 17:24:41 发布

shj1119

最新推荐文章于 2024-05-01 17:24:41 发布

阅读量578

点赞数

分类专栏： hadoop学习积累文章标签： Hadoop

hadoop学习积累专栏收录该内容

31 篇文章 0 订阅

订阅专栏

收集自网络几篇文章=====

入门：
知道MapReduce大致流程，map, shuffle, reduce
知道combiner, partition作用，设置compression
搭建hadoop集群，master/slave 都运行那些服务
HDFS，replica如何定位
版本0.20.2->0.20.203->0.20.205, 0.21, 0.23, 1.0. 1
新旧API不同

进阶：.
Hadoop 参数调优，cluster level: JVM, map/reduce slots, job level: reducer #,
memory, use combiner? use compression?
pig latin, Hive　简单语法
HBase, zookeeper 搭建

最新：
关注cloudera, hortonworks blog
next generation MR2框架
高可靠性,　namenode: avoid single point of failure.
数据流系统：streaming storm(twitter).

演练算法：
wordcount
字典同位词
翻译sql语句 select count(x) from a group by b;
Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this
Speculative Execution

Q18. How does speculative execution works in Hadoop
Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Q.24 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution

Q25. Have you ever used Counters in Hadoop. Give us an example scenario
Anybody who claims to have worked on a Hadoop project is expected to use counters

Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

Q27. Is it possible to have Hadoop job output in multiple directories. If yes then how
Yes, by using Multiple Outputs class

Q28. What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.
1、编写一只爬虫
要求：1、可配置要爬取的网页URL格式
2、可定制要爬取的深度
3、对爬取下来的页面可由后期调用的程序进行存储（即事件）
2、现有大批量url需要爬取，其中url的解析以及n层抓取已有服务端实现（多级深度），现在给定若干台服务器以及不断增加的客户机，各服务端的url任务已有机制保证平衡，爬虫url任务由客户机向服务器请求并完成。
请设计一个分布式框架，以完成单层的ur抓取，并且每个服务器都能尽可能平均的获取客户机资源。
注意：服务器可能当机；

1、设计一套系统，使之能够从不断增加的不同的数据源中，提取指定格式的数据。
要求：1、运行结果要能大致得知提取效果，并可据此持续改进提取方法；
2、由于数据来源的差异性，请给出可弹性配置的程序框架；
3、数据来源可能有Mysql,sqlserver等；
4、该系统具备持续挖掘的能力，即，可重复提取更多信息；
2、编写一个工具，该工具能够根据不同的文档模板，生成提取格式化数据的正则表达式
=====
1、hadoop运行的原理?

2、mapreduce的原理?

3、HDFS存储的机制?

4、举一个简单的例子说明mapreduce是怎么来运行的 ?

5、面试的人给你出一些问题,让你用mapreduce来实现？

比如:现在有10个文件夹,每个文件夹都有1000000个url.现在让你找出top1000000url。

6、hadoop中Combiner的作用?