Hadoop Note

Hadoop:



History:

By Doug Cutting

2002: Nutch Started, but wouldn’t scale to the billions of pages on the Web

2003: the architecture of GFS

2004: NDFS Nutch Distributed File System; Google published the paper that introduced MapReduce

2005: implement MapReduce in Nutch

Feb. 2006: Move out of Nutch form Hadoop

Jan. 2008: Top-level project at Apache

Feb. 2008: Yahoo! Turn Hadoop into a system that ran at web scale

April 2008: Sort 1TB data with 209s

May 2008: Sort 1TB data with 62s

 

Note: Hadoop is scalable for big data volumes impressively. Hadoop as an analytic platform is also able to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data type.

Link:

http://hadoop.apache.org/

HDFS

HDFS High Availability (HA)

NameNodes:

         Manage the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The namenode also knows the datanodes on which all the blocks for a given file are located.

DataNodes:

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when

they are told to (by clients or the namenode), and they report back to the namenode

periodically with lists of blocks that they are storing.

 

MapReduce V1

In MapReduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together.



Map/Reduce v2 (YARN) in Apache



 

1.       ResourceManager

a)         Scheduler

Allocating resources to the various running application subject to familiar constraints of capacities, queues etc. Plug-ins: CapacityScheduler, FairScheduler

 

b)         ApplicationsManager

Accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster on failure.

2.       NodeManager

The per-machine agent who is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager/Scheduler

3.       ApplicationMaster

Negotiating appreciate resource containers from the Scheduler, tracking their status and monitoring for progress

 

Hive

Link:

a)         http://hive.apache.org/

b)        https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=6B780B73CAD2649F078738A9BC32E59D

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

l  Tools to enable easy data extract/transform/load (ETL)

l  A mechanism to impose structure on a variety of data formats

l  Access to files stored either directly in Apache HDS or in other data storage systems such as Apache Hbase

l  Query execution via MapReduce

Pig

Link:

         http://pig.apache.org/

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

2 pieces:

1.       Pig Latin

2.       The execution environment to run Pig Latin programs.

 

HBase

Link:

http://hbase.apache.org/

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

History:

The HBase story begins in 2006, when the San Francisco-based startup Powerset was trying to build a natural language search engine for the Web

Zookeeper

Link:

 http://zookeeper.apache.org/

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

 

Sqoop

Cassandra

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值