Hadoop Introduction

Hadoop is an opensource framework for writing and running distributed applications that processlarge amounts of data. Distributed computing is a wide and varied field, butthe key distributions of Hadoop are that it is

Acciable – Hadoop runson large clusters of commodity machines or on cloud computing services such asAmazon’s Elastic Compute Cloud (EC2).

Robust - Because itis intended to run on commodity hardware, Hadoop is architected with theassumption of frequent hardaware malfunctions. It can gracefully handle mostsuch failures.

Scalable – Hadoop scaleslinearly to handle larger data by adding more nodes to the cluster.

Simple – Hadoop allowsusers to quickly write efficient parallel code.

SQL (StructuredQuery Language) is by design targeted at structured data. Many of Hadoop’sinitial applications deal with unstructured data such as text. From thisperspective Hadoop provides a more general paradigm than SQL. A machine withfour times the power of a standard PC costs a lot more than putting four suchPCs in a cluster. Hadoop is designed to be a scale-out architecture operatingon a cluster of commodity PC machines. Adding more resources means adding moremachines to the Hadoop cluster. Hadoop Clusters with ten to hundreds ofmachines is standard. In fact, other than for development purposes, there’s noreason to run Hadoop on a single server.

Large data sets areoften unstructured or semistrcutrued. Hadoop uses key/value pairs as its basicdata unit, which is flexible enough to work with the less-structured datatypes. In Hadoop, data can originate in any form, but it eventually transformsinto (key/value) pairs for the processing functions to work on.

Hadoop is best usedas a write-once, read-many-times type of data store. In this aspect it’ssimilar to data warehouses in the SQL world.

Data processingmodels such as pipelines and message queues. Pipelines can help the reuse ofprocessing primitives; simple chaining of existing modules creates new ones.Message queues can help the synchronization of processing primitives. Theprogrammer writes her data processing task as processing primitives in the formof either a producer or a consumer. The timing of their execution is managed bythe system. Similarly, MapReduce is also a data processing model. Its greatestadvantage is the easy scaling of data processing over multiple computingnodes.Under the MapReduce model, the data processing primitives are calledmappers and reducers.

If the documentsare all stored in one central storage server, then the bottleneck is in thebandwidth of that server.

In the mappingphase, MapReduce takes the input data and feeds each data to the mapper. In thereducing phase, the reducer processes all the outputs from the mapper andarrives at a final result. In simple terms, the mapper is meant to filter andtransform the input into something that the reducer can aggregate over.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值