Hadoop，BigData生命周期管理简介

最新推荐文章于 2021-08-28 04:53:50 发布

cunchi4221

最新推荐文章于 2021-08-28 04:53:50 发布

阅读量801

点赞数

文章标签：分布式大数据编程语言 python java

原文链接：https://www.journaldev.com/8795/introduction-to-hadoop

版权

Before reading this post, please go through my previous post at “Introduction to BigData” to get some BigData Basics. In this post, we will discuss about Hadoop Basics.

在阅读本文之前，请仔细阅读我以前的文章“ BigData简介 ”，以获取一些BigData基础知识。在本文中，我们将讨论Hadoop基础。

邮政的简要目录： (Post’s Brief Table Of Contents:)

Introduction to Hadoop
Hadoop简介
What is Apache Hadoop?
什么是Apache Hadoop？
Why Apache Hadoop to Solve BigData Problems?
为什么Apache Hadoop解决BigData问题？
Hadoop Advantages
Hadoop的优势
Hadoop is Suitable For
Hadoop适用于
Hadoop is NOT Suitable For
Hadoop不适合
Hadoop Deployment Modes
Hadoop部署模式
Hadoop 2.x Components
Hadoop 2.x组件
Hadoop 2.x Components Responsibilities
Hadoop 2.x组件责任
BigData Life-Cycle Management
大数据生命周期管理

Hadoop简介 (Introduction to Hadoop)

We are living in “BigData” Era. Most of the Organizations are facing BigData Problems.

我们生活在“大数据”时代。大多数组织都面临着BigData问题。

Hadoop is an Open Source framework from Apache Software Foundation to solve BigData Problems. It is completely written in Java Programming Language.

Hadoop是Apache Software Foundation的开放源代码框架，用于解决BigData问题。它完全用Java编程语言编写。

Google published two Tech Papers: one is on Google FileSystem (GFS) in October 2003 and another on MapReduce Algorithm in Dec 2004. Google FileSystem is a Google’s proprietary distributed FileSystem to store and manage data efficiently and reliably using commodity hardware. MapReduce is a Parallel and distributed programming model, which is used for processing and generating large Datasets.

Google发表了两篇技术论文：一份于2003年10月发表在Google FileSystem（GFS）上，另一份于2004年12月发表在MapReduce Algorithm上。GoogleFileSystem是Google专有的分布式FileSystem，用于使用商品硬件有效地存储和管理数据。 MapReduce是并行和分布式编程模型，用于处理和生成大型数据集。

Google solves their BigData Problems using these two Components: GFS and MapReduce Algorithm.

Google使用以下两个组件解决了BigData问题：GFS和MapReduce算法。

Hadoop was initially inspired, designed and developed by following Google’s Paper on “MapReduce Algorithm and Google FileSystem(GFS)”.

Hadoop最初是根据Google关于“ MapReduce算法和Google FileSystem（GFS）”的论文启发，设计和开发的。

All Apache Hadoop core modules are developed by using Java. Latest Hadoop Version is 2.x.

所有Apache Hadoop核心模块都是使用Java开发的。最新的Hadoop版本是2.x。

Above image is a Logo of Apache Hadoop Software.

上图是Apache Hadoop软件的徽标。

什么是Apache Hadoop？ (What is Apache Hadoop?)

Apache Hadoop is an Open-Source BigData Solution Framework for both Distributed Storage, Distributed Computing and Cloud Computing using Commodity Hardware.

Apache Hadoop是一个开放源代码的BigData解决方案框架，适用于使用商品硬件的分布式存储，分布式计算和云计算。

Apache Hadoop Office Website: https://hadoop.apache.org/

Apache Hadoop Office网站：https://hadoop.apache.org/

NOTE:-What is Commodity Hardware?
Commodity Hardware means very In-expensive Normal Hardware, which is designed with normal Hardware components for normal computing purpose. It is very Cheap non-enterprise Hardware device.

注意：-什么是商品硬件？
商品硬件是指非常昂贵的普通硬件，该普通硬件设计用于普通计算目的的普通硬件组件。这是非常便宜的非企业硬件设备。

It is a Data Management software framework with Scale-out storage and Distributed Processing.

它是具有横向扩展存储和分布式处理的数据管理软件框架。

It uses Commodity Hardware and gives very Cost-effective BigData Solution by using Distributed Computing. Some vendors also supports BigData Hadoop Solutions using Cloud, for example AWS (Amazon Web Services).

它使用商品硬件，并通过使用分布式计算提供非常具有成本效益的BigData解决方案。一些供应商还支持使用云的BigData Hadoop解决方案，例如AWS（Amazon Web Services）。

Any BigData Hadoop Solution mainly provides two kinds of services:

任何BigData Hadoop解决方案都主要提供两种服务：

Storage Service
仓储服务
Computation Service
计算服务

为什么Apache Hadoop解决BigData问题？ (Why Apache Hadoop to Solve BigData Problems?)

Apache Hadoop is an open-source BigData Solution software. We should use this for the following reasons:

Apache Hadoop是开源BigData解决方案软件。出于以下原因，我们应该使用它：

Open Source
开源的
Very Reliable
非常可靠
Highly Scalable
高度可扩展
Uses Commodity Hardware
使用商品硬件

As existing tools are not able to handle that much huge variety data, we can use Apache Hadoop BigData Solution to solve these problems.

由于现有工具无法处理那么多庞大的数据，因此我们可以使用Apache Hadoop BigData解决方案来解决这些问题。

Hadoop的优势 (Hadoop Advantages)

Apache Hadoop provides the following benefits in solving BigData Problems:

Apache Hadoop在解决BigData问题方面具有以下优势：

Open Source
开源的
Apache Hadoop is Open Source BigData Solution with free license from Apache Software Foundation.

Apache Hadoop是开源的BigData解决方案，具有Apache Software Foundation的免费许可。
Highly Availability
高可用性
Hadoop Solution uses Replication Technique. By default it uses Replication factor = 3. If required, we can change this value.

Hadoop解决方案使用复制技术。默认情况下，它使用复制因子=3。如果需要，我们可以更改此值。

If one node is down for some reason, it will automatically pickup data from other near-by and available node. Hadoop System finds that failure node automatically and do the necessary things to up and running that node. So that it is highly available.

如果一个节点由于某种原因关闭，它将自动从其他附近的可用节点中提取数据。 Hadoop系统会自动找到该故障节点，并执行必要的操作来启动和运行该节点。因此它是高度可用的。

So Apache Hadoop provides no downtime BigData Solutions.

因此，Apache Hadoop不提供停机时间BigData解决方案。
Highly Scalable
高度可扩展
Hadoop is highly Scalable, because it can store and distribute very huge amount of Data across hundreds of thousands of commodity hardware that operates in parallel. We can scale it in Horizontally or Vertically based on our Project requirements.

Hadoop具有高度可扩展性，因为它可以跨并行运行的数十万种商品硬件存储和分发大量数据。我们可以根据项目要求在水平或垂直方向上进行缩放。
Better Performance
更好的性能
Even though Hadoop uses commodity hardware, it distributes work into different nodes and perform those tasks parallel. So that it can process PB (Peta Bytes) or More amount of Data in just few minutes and gives better performance.

即使Hadoop使用商用硬件，它仍将工作分配到不同的节点并并行执行这些任务。这样它可以在短短几分钟内处理PB（P字节）或更多数据量，并提供更好的性能。

NOTE:- Node means any commodity computer in Hadoop Cluster.

注意：-节点表示Hadoop群集中的任何商用计算机。
Handles Huge and Varied types of Data
处理庞大和多样化的数据类型
Hadoop handles very huge amount of variety of data by using Parallel computing technique.

Hadoop通过使用并行计算技术来处理大量数据。
Cost-Effective BigData Solutions
具有成本效益的大数据解决方案
Unlike Traditional Relational Databases and Tools, Hadoop uses very in-expensive and non-enterprise commodity hardware to setup Hadoop Clusters. We don’t need to buy very Expensive, High-Capacity and High Performance Hardware to solve our BigData Problems. Hadoop uses Cheap Hardware and deliver very effective solutions.

与传统的关系数据库和工具不同，Hadoop使用非常便宜且非企业的商品硬件来设置Hadoop集群。我们不需要购买非常昂贵，高容量和高性能的硬件来解决我们的BigData问题。 Hadoop使用便宜的硬件并提供了非常有效的解决方案。
Increases Profits
增加利润
By using very Cheap commodity hardware to construct Our BigData Network, it increases Profits. If we use Cloud Technology to solve BigData Problems, we can improve our profits a lot.

通过使用非常便宜的商品硬件来构建我们的BigData网络，它可以增加利润。如果我们使用云技术解决大数据问题，我们可以大大提高利润。
Very Flexible
非常灵活
Hadoop can accept any kind of Data Formats from different data sources. We can integrate new Data Sources with Hadoop system and use them very easily.

Hadoop可以接受来自不同数据源的任何类型的数据格式。我们可以将新的数据源与Hadoop系统集成在一起，并非常容易地使用它们。
Fault Tolerance Architecture
容错架构
Hadoop provides very fault tolerant architecture to solve Big Data Problems. We will discuss how it is resilient to failures soon in my coming posts.

Hadoop提供了非常容错的架构来解决大数据问题。我们将在以后的文章中很快讨论它如何应对失败。
Solves Complex Problems
解决复杂的问题
As Hadoop follows Distributed/Parallel Processing, it solves complex problems very easily.

当Hadoop遵循分布式/并行处理时，它可以非常轻松地解决复杂的问题。

Hadoop适用于 (Hadoop is Suitable For)

Apache Hadoop is suitable to solve the following kinds of BigData problems:

Apache Hadoop适合解决以下种类的BigData问题：

Recommendation systems
推荐系统
Processing Very Big DataSets
处理非常大的数据集
Processing Diversity of Data
处理数据的多样性
Log Processing
日志处理
Best to process Data when it is rest.
最好在休息时处理数据。

Hadoop不适合 (Hadoop is NOT Suitable For)

Hadoop is not suitable for all BigData Solutions. The following are the few scenarios where BigData Hadoop Solution is not suitable:

Hadoop并不适合所有BigData解决方案。以下是一些不适合使用BigData Hadoop解决方案的方案：

Processing small DataSets
处理小型数据集
Executing Complex Queries
执行复杂查询
Bit tough to process Data when it is in Motion
在移动中处理数据有点困难

Hadoop大数据解决方案 (Hadoop BigData Solutions)

We have many BigData Hadoop Solutions in the current market. Most popular solutions are:

当前市场上，我们有许多BigData Hadoop解决方案。最受欢迎的解决方案是：

Apache Hadoop
阿帕奇Hadoop
CloudEra BigData Hadoop Solution
CloudEra BigData Hadoop解决方案
IBM Hortonworks
IBM Hortonworks
Google Cloud BigData Solution
Google Cloud BigData解决方案
AWS EMR(Amazon Web Services – Elastic MapReduce)
AWS EMR（Amazon Web服务– Elastic MapReduce）
MapR
地图

NOTE:-
All above BigData Hadoop Solutions are implemented by following Apache Hadoop Software.

注意：-
以上所有BigData Hadoop解决方案都是通过遵循Apache Hadoop软件实现的。

Hadoop部署模式 (Hadoop Deployment Modes)

Apache Hadoop Software is deployed or operated or installed in the following three modes:

Apache Hadoop软件的部署，运行或安装方式有以下三种：

Standalone Mode
独立模式
It is used for Simple Analysis Purpose or Debugging purpose. It is not Distributed or Clustered Architecture, just installed on a Single Node for Testing purpose.

它用于简单分析目的或调试目的。它不是分布式或群集体系结构，仅安装在单个节点上以进行测试。
Pseudo Distributed Mode
伪分布式模式
It is installed in a Single Node, but simulated like installed on Multiple Servers. It creates a simulated Hadoop Cluster Of Nodes, but not really distributed. It is mainly useful for preparing POC (Proof Of Concept) to Test Multiple Nodes and Clustered Hadoop System.

它安装在单个节点上，但模拟安装在多个服务器上。它创建了一个模拟的Hadoop节点集群，但并未真正分布。它主要用于准备POC（概念验证）以测试多个节点和集群Hadoop系统。
Fully Distributed Mode
全分布式模式
It is a real Fully Distributed Hadoop Clustered Architecture. It is used in Live BigData Solutions Systems.

它是真正的完全分布式Hadoop集群体系结构。它用于实时大数据解决方案系统中。

Hadoop 2.x组件 (Hadoop 2.x Components)

Apache Hadoop Version 2.x has the following three major Components:

Apache Hadoop版本2.x具有以下三个主要组件：

HDFS
HDFS
YARN
纱
MapReduce
MapReduce

We will discuss major differences between Hadoop V1.x and V2.x and also how these components in Hadoop environment to solve BigData Solutions in my coming posts.

在接下来的文章中，我们将讨论Hadoop V1.x和V2.x之间的主要区别，以及Hadoop环境中的这些组件如何解决BigData解决方案。

Hadoop 2.x组件责任 (Hadoop 2.x Components Responsibilities)

The main responsibilities of Apache Hadoop 2.x Components are:

Apache Hadoop 2.x组件的主要职责是：

Data Storage
数据存储
Resource Management
资源管理
Data Integration
资料整合
Data Governance
数据治理
Data and Batch Processing
数据和批处理
Data Analysis
数据分析
Real-time Computing
实时计算

We will discuss which Hadoop Individual component is responsible to do these tasks in-detail in my coming posts.

在我的后续文章中，我们将详细讨论哪个Hadoop个体组件负责执行这些任务。

大数据生命周期管理 (BigData Life-Cycle Management)

Generally, Hadoop Systems uses the following Life-Cycle to manage it’s BigData:

通常，Hadoop系统使用以下生命周期来管理其BigData：

First, Data Sources are create BigData. Here Data Sources are anything like Social Media, Internet, Mobile, Computer, Documents, Audio and Videos, Cameras,Sensors etc. Once BigData is created by systems, it is captured and processed into some formats to store into Hadoop Storage system.

首先，数据源是创建BigData。这里的数据源是诸如社交媒体，互联网，移动设备，计算机，文档，音频和视频，照相机，传感器等之类的东西。一旦系统创建了BigData，它就会被捕获并处理成某种格式以存储到Hadoop存储系统中。

After Storing BigData into Hadoop Storage, it is transformed and stored into Some NoSQL or Hadoop Database.

将BigData存储到Hadoop存储中后，将其转换并存储到Some NoSQL或Hadoop数据库中。

Then we will use some Hadoop tools to analyse the BigData and prepare the reports.

然后，我们将使用一些Hadoop工具来分析BigData并准备报告。

Business or Organizations will go through those reports or visualizations to understand the needs and do necessary actions to improve business value.

业务或组织将通过这些报告或可视化来了解需求并采取必要的措施来提高业务价值。

NOTE:-
If you don’t understand any terminology at this stage, don’t worry. Read next posts and practice some programs. Once you are clear with Hadoop Architecture and How Hadoop Components works, then come back to this page and read it again. I bet you will get clear idea about these concepts.

注意：-
如果您在此阶段不了解任何术语，请不要担心。阅读下一篇文章并练习一些程序。一旦您了解了Hadoop架构和Hadoop组件的工作原理，然后返回此页面并再次阅读。我敢打赌，您将对这些概念有清晰的认识。

That’s it all about Hadoop Introduction and BigData Life-Cycle Management. We will discuss Hadoop Architecture and How Hadoop Component’s works in my coming posts.

这就是有关Hadoop简介和BigData生命周期管理的全部内容。在接下来的文章中，我们将讨论Hadoop体系结构以及Hadoop组件的工作原理。

Please drop me a comment if you like my post or have any issues/suggestions.

如果您喜欢我的帖子或有任何问题/建议，请给我评论。

翻译自: https://www.journaldev.com/8795/introduction-to-hadoop

cunchi4221

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop，BigData生命周期管理简介

Before reading this post, please go through my previous post at “Introduction to BigData” to get some BigData Basics. In this post, we will discuss about Hadoop Basics. 在阅读本文之前，请仔细阅读我以前的文章“ BigData简介 ...
复制链接

扫一扫