Hadoop:
History:
By Doug Cutting
2002: Nutch Started, but wouldn’t scale to the billions of pages on the Web
2003: the architecture of GFS
2004: NDFS Nutch Distributed File System; Google published the paper that introduced MapReduce
2005: implement MapReduce in Nutch
Feb. 2006: Move out of Nutch form Hadoop
Jan. 2008: Top-level project at Apache
Feb. 2008: Yahoo! Turn Hadoop into a system that ran at web scale
April 2008: Sort 1TB data with 209s
May 2008: Sort 1TB data with 62s
Note: Hadoop is scalable for big data volumes impressively. Hadoop as an analytic platform is also able to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data type.
Link:
HDFS
HDFS High Availability (HA)
NameNodes:
DataNodes:
Datanodes are the workhorses of the filesystem. They store and retrieve blocks when
they are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
MapReduce V1
In MapReduce, records are processed in isolation by tasks called
Map/Reduce v2 (YARN) in Apache
1.
a)
Allocating resources to the various running application subject to familiar constraints of capacities, queues etc. Plug-ins: CapacityScheduler, FairScheduler
b)
Accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster on failure.
2.
The per-machine agent who is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager/Scheduler
3.
Negotiating appreciate resource containers from the Scheduler, tracking their status and monitoring for progress
Hive
Link:
b)
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
l
l
l
l
Pig
Link:
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
2 pieces:
1.
2.
HBase
Link:
Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
History:
The HBase story begins in 2006, when the San Francisco-based startup Powerset was trying to build a natural language search engine for the Web
Zookeeper
Link:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.