AN INTRODUCTION TO HADOOP_hadoop was developed by-CSDN博客

本文介绍了Hadoop这一开源框架的基本概念、发展历程及不同版本。详细解释了Hadoop如何利用集群环境存储和处理大规模数据集，并探讨了其在现代大数据处理中的重要作用。此外，还概述了Hadoop的高级架构、主要组件及其优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://www.mindtory.com/an-introduction-to-hadoop/

大数据的话题一直在进化，储存和分析是使用数据的先决条件。Hadoop是Apache的大数据技术，对于想要了解更多大数据问题的人来说非常重要。这篇文章将会帮助你了解Hadoop以及hadoop究竟是什么，探究hadoop的历史和不同版本。对于外行，文章也会介绍Hadoop 的高级架构和不同模块。Shachi Marathe 为你介绍大数据和hadoop的概念，历史，使用，优点。

Big data is topic that is ever evolving. Data cannot be of any use if it is not stored and analyzed properly. Hadoop is a technology by Apache that helps in making large volumes of Data useful. So, Hadoop is an important technology to be looked at when one wants to know more about Big Data.This article will help you know more about Hadoop and what is actually is. The article will delve a bit into the history and different versions of Hadoop. For the un-initiated, it will also look at high level architecture of Hadoop and its different modules.Shachi Marathe introduces you to the concept of Hadoop for Big Data. The article touches on the basic concepts of Hadoop, its history, advantages and uses. The article will also delve into architecture and components of Hadoop.

By reading this article you will learn:

What is Hadoop?

History and versions of Hadoop

Advantages of using Hadoop Applications

where Hadoop can be used

Architecture / Structure

Cloud based systems

Distributed file systems (HDFS)

Map Reduce

YARN

Brief on additional components

What is Hadoop? 什么是Hadoop

Hadoop is an open source framework, which enables storage of large data-sets and processing of these large data-sets, by using a clustered environment.

Hadoop 是应用集群环境存储和处理大数据集的开源架构

The current version is Hadoop 3.0 and it is developed by a group of developers and contributors, under the Apache Software Foundation.

当前版本3.0，是由Apache Software Foundation旗下的开发者推出

The Hadoop framework, enables the handling and processing of thousands of terabytes of data, by making it possible to run and use applications on multiple community nodes. Hadoop has a distributed file system that ensures that if there is a failure of a node, it does not get the whole system down.

Hadoop架构让应用程序运行在大量社区节点上，使得处理上千TB的数据成为可能。分布式文件系统保证整体系统不会因为一个节点的失败而受到影响

This robust nature of the Hadoop framework has played a key role in processing of Big Data and its usage by organizations that use large volumes of data for their business.

这种鲁棒性是处理大数据的关键，因此很多机构采用Hadoop架构处理大量商业数据

A little bit of History　一点历史

At the end of the 20th century, Internet usage started picking up worldwide. This led to a lot of data being generated, shared and used across the world. Most of the data was text-based and so, basic search engines with indexes on data were created to make data accessibility easier. But as data grew in size, simple manual indexing and search started failing, giving way to web crawlers and search-engine companies.

２０世纪末，互联网在全世界兴起，导致大量数据产生，并被广泛传播和分享。那时大部分数据是文本形式的，简单的搜索引擎和数据索引被创造出来使得访问更容易。但是数据增长很快，简单的手工索引和搜索开始失效，逐渐让位于网络爬虫和搜索引擎公司。

Nutch, was an open-source search engine over the web, nurtured by Doug Cutting and Mike Cafarella. This search engine distributed the data and calculations across multiple different computers, thus ensuring that multiple tasks were performed simultaneously, thus ensuring faster results. This was the first stint related to automated distributed data and processing.

Nutch，为Doug Cutting 和 Mike Cafarella　研发的开源搜索引擎，使用大量计算机分散处理数据和计算，多重任务同时进行，因而更快获取结果。这便是第一个自动分配数据和处理的相关工作。

Around 2006, the Nutch project was divided into a web crawler component and a processing component. The distributed computing and processing component was renamed as Hadoop (which was in fact the name of Doug Cutting’s toddler’s toy elephant).

在２００６年，Nutch项目分离为网络爬虫子项和处理子项。分布式计算和处理子项重命名为Hadoop（原是Doug Cutting’s的幼儿玩具大象的名字）

In 2008, Hadoop was released as an open-source project by Yahoo. Over the years, Hadoop has evolved and is currently its framework and technology modules are being maintained by the Apache Software Foundation, which is a global community of software enthusiasts, both, developers and contributors.

在２００８年，　Hadoop作为Yahoo的开源项目得以发行，经过多年发展，Hadoop技术和架构由Apache Software Foundation　维护

Hadoop 1.0　became available to the public in November 2012. The framework was based on Google’s MapReduce　in which an application is split into multiple smaller parts, called blocks or fragments and these parts can then be run on any node in a cluster environment (group of resources, including servers, which ensure high availability, parallel processing and load balancing).

２０１２年１１月Hadoop 1.0　走入大众视野。基于Google’s MapReduce，应用被分割成很多小块，或者叫碎片，这些小块可以在集群（计算资源统称，包括服务器，支持高可用性，并行处理和负载均衡）的任意节点运行

Hadoop 2.0, released in 2013 and focused on better resource management and scheduling.

Hadoop 2.0,　２０１３年发行，专注于提升资源和时间管理

In January 2017, Hadoop 3.0 was released with some more enhancements.

２０１７年１月，Hadoop 3.0　发行并带来更多性能优化

Advantages of Using Hadoop　使用Hadoop的好处

Some very basic advantages of Hadoop are –

一些非常基本的优点

Hadoop is a low-cost framework that can be used either as an in-house deployment or used over the Cloud.
Hadoop provides a high-performance framework, which is highly reliable, scalable and has massive data management capabilities. All these form the back-bone of an effective application.
Hadoop can work on and process large volumes of data, which is volatile as well.
Hadoop works effectively and efficiently on structured as well as unstructured data.

Hadoop是一种低廉的架构，可以采用室内或云配置

Hadoop　提供了高性能的架构，可靠性高，可扩展，可管理超大数据

Hadoop　可以处理大量不稳定的数据

Hadoop　高效处理结构化和非结构化数据

Systems where Hadoop can be used　哪些系统可以采用Hadoop

Hadoop can be used across multiple applications and systems based on an Organization’s requirements. Here is a list of some such applications

Hadoop 的使用是可以基于企业要求跨多重系统和应用的，这里列出如下应用

网络分析帮助企业研究来访者在他们的网站上的活动，以此提升市场战略

智能操作-- 数据被捕捉并研究，用来寻找可填补空白

风险管理和安全管理 -- 数据被收集和研究，用来探测交易模式和虚假活动

地理信息应用 -- 卫星数据被搜集和研究，这种通常是非常大量并且为相关性研究服务

还有许多其他Hadoop应用场景，例如最大化市场效益

Web Analytics which help companies study visitors to their websites and their activities on the site. This can then be helped to improve marketing strategies.

Operational Intelligence – where, data is captured and studied to find lacuna in processes and improve on the same.
Risk Management and Security Management – Data trends can be collected and studied to check on patterns of transactions and fraudulent activities.
Geo-spatial applications – Data from satellites can be gathered and studied. This data is usually voluminous and co-relational study is paramount.

There are many other applications the can use Hadoop implementations to optimize marketing, etc.

The Hadoop Ecosytem

The main components of the Hadoop ecosystem are

Hadoop生态系统的主要组成如下

Hadoop Common ，从字面上理解，这个部分包含其他模块需要的通常库

Hadoop Distributed File System – HDFS， 这是Hadoop的分布式文件系统，通过集群的大量节点分块存储数据

Hadoop YARN -- Yarn，一个可以连接其他事务的线程，类似的， Hadoop YARN 通过合适的日程安排管理集群的资源

Hadoop MapReduce，--Hadoop MapReduce 提供可编程的大数据处理架构

Hadoop Common – As the name suggests, this component contains the common libraries and utilities that are needed by other Hadoop modules and components.
Hadoop Distributed File System – HDFS – This is Hadoop’s distributed file system that stores data and files in blocks across many nodes in a cluster.
Hadoop YARN – A Yarn, as we know it is a thread that can be used to tie things together. Similarly, the Hadoop YARN manages resources with appropriate scheduling, in the cluster.
Hadoop MapReduce – Hadoop MapReduce enables a programmer to work within a framework and build programs for large scale data processing

Over and above the components and modules stated above, the Hadoop platform comes with additional related applications or modules. These are Apache Oozie, Apache Hive, Apache HBase, Mahout, Apache Pig, etc.

除了上述组件和模块， hadoop平台也提供其他相关应用模块，例如 Apache Oozie, Apache Hive, Apache HBase, Mahout, Apache Pig,等

Hadoop模块被设计成在集群中，即使有硬件故障，架构将保障服务不会停止。架构通常由java编写，一些事C和shell，这一点令其更容易基于你的需求采用和执行。

Hadoop modules are designed and implemented such that, in a cluster, even if there are hardware failures, the framework will handle it and ensure there is no stoppage of service. The framework is written mostly in Java and some in C and shell scripts. This makes it easier to adopt and implement based in your requirements.

What is a cluster?什么是集群？

A cluster means a group. When multiple, independent servers are grouped to work together, to ensure high availability, scalability and reliability, they form a server cluster. Applications based on server clusters ensure that users have uninterrupted access to important resources running on the server, thus improving working efficiency.

If one machine in this cluster fails, the workload is distributed across the other servers and this enables uninterrupted services to clients.

What is a Hadoop Cluster?

A Hadoop cluster, as mentioned above is a cluster of servers and machines, which is designed to handle large volumes of structured as well as unstructured data in a distributed environment. Typically, there will be a NameNode and a JobTracker, and these are the masters who control the slave machines, called the DataNode and the TaskTracker.

Figure # 4: Hadoop Cluster

As we already know, Big Data deals with large volumes of data which is volatile and unstructured and the results need to be processed in real-time. This requires higher processing power, multi-tasking, scalability, reliability, which means no failure conditions. Hadoop cluster ensures all this by providing the ability to add more cluster nodes, and data is duplicated over nodes to ensure zero loss of data in case of failures.

What is a Distributed File System?

In a distributed file system (DFS), data is stored on a server. These data and files can then be accessed by multiple users from multiple locations. Access to the data can be controlled over the network. The user gets a feeling of working on local machines, but is actually working on the shared server.

What is HDFS?

The Hadoop Distributed File System (HDFS) uses the Hadoop Cluster to ensure a master / slave architecture. HDFS is scalable and portable and written in Java, specially for the Hadoop Framework. Typically, each main node is called a NameNode and it controls a cluster of DataNodes. Each NameNode need not have a DataNode, but a DataNode cannot exist without a NameNode.

So, in simpler terms, large files of data are stored across multiple machines. Data us replicated across multiple DataNodes, which then work together to ensure integrity of data. But this works best when Hadoop Distributed File System is used with its own modules as this gives maximum utilization. Else, the effectiveness may not be optimal.

Figure # 5: HDFS

What is MapReduce?

MapReduce is the framework that enables a user to code applications to process data.

MapReduce works in a Job-Task system. So, for each MapReduce Job, multiple tasks are created and managed. These are split such that they can be performed in parallel.

The Hadoop framework is implemented in Java, but MapReduce applications need not be written in Java.

Earlier MapReduce managed resources in the cluster as well as Data Processing, but now there is one more layer between MapReduce and HDFS and its called YARN. YARN does the resource management, leaving data processing to the MapReduce component.

Figure # 6: MapReduce

What is YARN?

YARN stands for Yet Another Resource Negotiator. The YARN layer was introduced in Hadoop 2.0, as an intermediate layer to manage resources within the cluster. It works like a large-scale distributed operating system.

YARN manages the applications running on Hadoop, such that, multiple applications can run in parallel with effective resource sharing, which was not always feasible in MapReduce. This makes it more suitable for Hadoop to run more interactive applications and operations, instead of functioning in a batch processing scenario. This has also done away with the Job Tracker and Task tracker scenario and introduced two daemons, the Resource Manager and the Node Manager.

Other Hadoop Components

Other than HDFS, MapReduce and YARN, there are multiple components that Hadoop offers that can be used –

Apache Flume – Is a tool that can be used to collect, move and aggregate large amounts of data into the Hadoop Distributed File System.
Apache HBase – As the name suggests, HBase is an open source, non-relational, distributed database which is part of Apache Hadoop.
Apache Hive – Apache Hive is like a data warehouse. Hive thus provides, summarization, query and analysis of data.
Cloudera Impala – This component was originally developed by Cloudera, a software company. It is a parallel processing database that is used with Hadoop.
Apache Oozie – is a workflow scheduling system, which is server based and is primarily used to manage jobs in Hadoop.
Apache Phoenix – is based on Apache HBase. It is also an open source tool, that serves as a parallel processing, relational database engine.
Apache Pig – Programs can be created on Hadoop, using Apache Pig.
Apache Spark – Apache Spark is a fast engine. It is used for processing Big Data, especially when it comes to processing of graphs and machine learning.
Apache Sqoop – is used to scoop and transfer large volumes of data between Hadoop and other data stores that are structured. These data transfers can also be to relational databases.
Apache ZooKeeper – The ZooKeeper is also available as an open source service used for configuration and synchronization of large distributed systems.
Apache Storm – Apache Storm is an open source data processing system.

Figure # 8: Hadoop Ecosystem

Conclusion

Hadoop is a very useful open-source framework that can be used by Organizations implementing Big Data techniques for their business. Hadoop was earlier traditionally used by organizations as in-house installation, but now, considering the rising costs of infrastructure and management requirements, Organizations are shifting towards Hadoop over the Cloud.