Hadoop Tutorial

Hadoop Tutorial

The Hadoop tutorial is a comprehensive guide on Big Data Hadoop that covers what is Hadoop, what is the need of Apache Hadoop, why Apache Hadoop is most popular, How Apache Hadoop works?

Apache Hadoop is an open source , Scalable , and  Fault-tolerant framework written in Java . It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as Processing. This Big Data Hadoop Game provides a thorough Hadoop introduction.

We will also learn in this Hadoop tutorial about Hadoop architecture, Hadoop daemons, different flavors of Hadoop. At last, we will cover the introduction of Hadoop components like HDFS , MapReduce , Yarn , etc.

Apache Hadoop Tutorial For Beginners

 

What is Hadoop Technology?

Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project means it is readily available and we can even change its source code as per the requirements. If certain functionality does not fulfill your need then you can change it according To your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.

It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Apache Hadoop provides parallel processing of data as it works on multiple machines simultaneously. Let us see a video Hadoop Tutorial to understand what is Hadoop in a better way.

By getting inspiration from Google , which has written a paper about the technologies. It is using technologies like  Map-Reduce  programming model as well as its file system ( GFS ). As Hadoop was originally written for the Nutch search engine project. When Doug Cutting His his were working on it, very soon Hadoop became a top-level project due to huge imagination. Let us understand Hadoop Definition and Meaning.

Apache Hadoop is an open source framework written in Java . The basic Hadoop programming language is Java, but this does not mean you can code only in Java. You can code in C, C++, Perl, Python, ruby etc. You can code the Hadoop framework in any language but it will be more good to code in java as you will have lower level control of the code.

Hadoop is for processing huge volume of data. Commodity hardware is the low-end hardware, they are cheap devices which are very economical. Hence, Hadoop is very economic. .

Hadoop CAN BE  the SET up ON A SINGLE Machine (pseudo-Distributed the MODE,  But IT Shows the ITS Real Power with A Cluster of Machines. We CAN Scale IT to Thousand Nodes ON at The Fly IE, the without the any downtime. THEREFORE, WE need not the make the any System down to add more systems in the cluster. Follow this guide to learn Hadoop installation on a multi-node cluster.

Hadoop consists of three key parts –

  • Hadoop Distributed File System  (HDFS) – It is the storage layer of Hadoop.
  • Map-Reduce – It is the data processing layer of Hadoop.
  • YARN – It is the resource management layer of Hadoop.

Why Hadoop?

Apache Hadoop is not only a storage system but is a platform for data storage as well as processing. It is  scalable (as we can add more nodes on the fly),  Fault-tolerant (Even if nodes go down, data processed by another node ).

Following characteristics of Hadoop make it a unique platform:

  • It is not bounded by a single schema. Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured.
  • Excel is at processing data of complex nature. Its scale-out architecture divides workloads across many nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
  • Scales economically, as discussed it can deploy on commodity hardware. Apart from this its open-source nature guards against vendor lock.

What is Hadoop Architecture?

After understanding what is Apache Hadoop, let us now understand the Big Data Hadoop Architecture in detail in this Hadoop tutorial.

Hadoop Tutorial - Hadoop Architecture
 

Hadoop Architecture

Hadoop Works in Master-Slave Fashion. There IS A Master Node and there are n-Numbers of Slave Nodes WHERE n-CAN BE 1000s. Master Manages, Maintains and Monitors The slaves the while slaves are The Actual worker Nodes. The In hadoop Architecture, The Master Should Deploy on good configuration hardware, not just commodity hardware. As it is the centerpiece of  Hadoop cluster .

Master stores the metadata (data about data) while slaves are the nodes which store the data. Distributedly data stores in the cluster. The client connects to master node to perform any task. Now in this Hadoop for beginners tutorial for beginners, we will discuss Different components of Hadoop in detail.

Hadoop Components

There are three most important Apache Hadoop Components. In this Hadoop tutorial, you will learn what is HDFS, what is Hadoop MapReduce and what is Yarn Hadoop. Let us discuss them one by one-

What is HDFS?

Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage in Hadoop in a distributed fashion.

In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the slaves a daemon called datanode run for HDFS. Thus, the slaves are also called as datanode. Namenode stores meta-data and manages the datanodes. On the other hand , Datanodes stores the data and do the actual task.

Hadoop Tutorial - Hadoop HDFS Architecture

 

HDFS is a highly  fault-tolerant , distributed , reliable and scalable file system for data storage. First Follow this guide to learn more about features of HDFS and then proceed with the Hadoop tutorial.

HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These blocks replicate as per the replication factor . After replication, it stored at different nodes. This handles the  failure of a node  in the cluster. So if there is a file of 640 MB, it breaks down into 5 blocks of 128 MB each (if we use the default value).

What is MapReduce?

In this Hadoop Basics Tutorial, now its time to understand one of the most important pillars if Hadoop, ie Hadoop MapReduce. The  Hadoop MapReduce is a programming model. As it is designed for large volumes of data in parallel by dividing the work into a set Of independent tasks. MapReduce is the heart of Hadoop, it varies monument to data. As a movement of a huge volume of data will be very costly. It allows massive scalability across thousands or thousands of servers in a Hadoop cluster.

Then, Hadoop MapReduce is a framework for distributed processing of huge volumes of data set over a cluster of nodes. As data stores in a distributed manner in HDFS. It provides the way to  Map - Reduce  to perform parallel processing.

 What is YARN Hadoop?

YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi-node cluster, as it becomes very complex to manage/allocate/release the resources (CPU, memory, disk).  Hadoop Yarn  manages the resources quite efficiently. It allocates the same on request from any application.

For more learn about Hadoop Daemons

转载于:https://www.cnblogs.com/Dataflair/p/8583187.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值