Hadoop Note

funny_king

于 2014-09-13 18:52:46 发布

阅读量419

点赞数

分类专栏： Hadoop 文章标签： Hadoop Linux

本文链接：https://blog.csdn.net/zzz_atm/article/details/39253939

版权

Hadoop 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hadoop:

History:

By Doug Cutting

2002: Nutch Started, but wouldn’t scale to the billions of pages on the Web

2003: the architecture of GFS

2004: NDFS Nutch Distributed File System; Google published the paper that introduced MapReduce

2005: implement MapReduce in Nutch

Feb. 2006: Move out of Nutch form Hadoop

Jan. 2008: Top-level project at Apache

Feb. 2008: Yahoo! Turn Hadoop into a system that ran at web scale

April 2008: Sort 1TB data with 209s

May 2008: Sort 1TB data with 62s

Note: Hadoop is scalable for big data volumes impressively. Hadoop as an analytic platform is also able to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data type.

Link:

http://hadoop.apache.org/

HDFS

HDFS High Availability (HA)

NameNodes:

Manage the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The namenode also knows the datanodes on which all the blocks for a given file are located.

DataNodes:

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when

they are told to (by clients or the namenode), and they report back to the namenode

periodically with lists of blocks that they are storing.

MapReduce V1

In MapReduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together.

Map/Reduce v2 (YARN) in Apache

1. ResourceManager

a) Scheduler

Allocating resources to the various running application subject to familiar constraints of capacities, queues etc. Plug-ins: CapacityScheduler, FairScheduler

b) ApplicationsManager

Accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster on failure.

2. NodeManager

The per-machine agent who is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager/Scheduler

3. ApplicationMaster

Negotiating appreciate resource containers from the Scheduler, tracking their status and monitoring for progress

Hive

Link:

a) http://hive.apache.org/

b) https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=6B780B73CAD2649F078738A9BC32E59D

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

l Tools to enable easy data extract/transform/load (ETL)

l A mechanism to impose structure on a variety of data formats

l Access to files stored either directly in Apache HDS or in other data storage systems such as Apache Hbase

l Query execution via MapReduce

Pig

Link:

http://pig.apache.org/

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

2 pieces:

1. Pig Latin

2. The execution environment to run Pig Latin programs.

HBase

Link:

http://hbase.apache.org/

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

History:

The HBase story begins in 2006, when the San Francisco-based startup Powerset was trying to build a natural language search engine for the Web

Zookeeper

Link:

http://zookeeper.apache.org/

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Sqoop

Cassandra

funny_king

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop Note

Hadoop:Note" title="Hadoop Note" style="margin:0px; padding:0px; border:0px; list-style:none">History:By Doug Cutting2002: Nutch Started, but wouldn’t scale to the billions of page
复制链接

扫一扫