Hadoop生态圈一览

最新推荐文章于 2024-07-27 18:36:27 发布

置顶

亲吻昨日的阳光

最新推荐文章于 2024-07-27 18:36:27 发布

阅读量7.1k

点赞数

分类专栏： hadoop笔记文章标签： hadoop集群并行计算生态圈文档 machine learning

本文链接：https://blog.csdn.net/kisssun0608/article/details/45338655

版权

本文深入介绍了Hadoop生态圈的主要组件，包括Hadoop、Ambari、Cassandra、Chukwa、HBase、Hive、Mahout、Pig、Spark、Tez、ZooKeeper、Sqoop和Flume。这些工具覆盖了数据导入导出、存储、处理、分析和监控等多个环节，形成了一套强大的大数据处理生态系统。通过理解这些组件的作用和相互关系，可以更好地掌握Hadoop在大数据处理中的应用。

摘要由CSDN通过智能技术生成

根据Hadoop官网的相关介绍和实际使用中的软件集，将Hadoop生态圈的主要软件工具简单介绍下，拓展对整个Hadoop生态圈的了解。

这是Hadoop生态从Google的三篇论文开始的发展历程，现已经发展成为一个生态体系，并还在蓬勃发展中....

这是官网上的Hadoop生态图，包含了大部分常用到的Hadoop相关工具软件

这是以体系从下到上的布局展示的Hadoop生态系统图，言明了各工具软件在体系中所处的位置

这张图是Hadoop在系统中核心组件与系统的依赖关系

下面就是简单介绍Hadoop生态圈中的一些工具

Hadoop

官网原文：
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

译文：
什么是Apache hadoop？
Apache Hadoop项目是以可靠、可扩展和分布式计算为目的而发展而来的开源软件

Apache Hadoop 软件库是一个允许在集群计算机上使用简单的编程模型来进行大数据集的分布式任务的框架。它是设计来从单服务器扩展到成千台机器上，每个机器提供本地的计算和存储。相比于依赖硬件来实现高可用，该库自己设计来检查和管理应用部署的失败情况，因此是在集群计算机之上提供高可用的服务，没个节点都有可能失败。

该项目包括模块：
Hadoop Common ：通用的工具来支持其他的Hadoop模块
Hadoop Distributed FileSystem(HDFS)：一个提供高可用获取应用数据的分布式文件系统
Hadoop YARN；Job调度和集群资源管理的框架
Hadoop MapReduce：基于YARN系统的并行处理大数据集的编程模型
其他Hadoop相关的项目：
Ambari：一个基于web的工具，用来供应、管理和监测Apache Hadoop集群包括支持Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari 也提供一个可视的仪表盘来查看集群的健康状态(比如热图)，并且能够以一种用户友好的方式根据其特点可视化的查看MapReduce、pig和Hive 应用来诊断其性能特征。
Avro :数据序列化系统。
Cassandra ：可扩展的多主节点数据库，而且没有单节点失败情况。
Chukwa : 管理大型分布式系统的数据收集系统
HBase ；一个可扩展的分布式数据库，支持大表的结构化数据存储
Hive ：一个提供数据概述和AD组织查询的数据仓库
Mahout ：可扩展大的机器学习和数据挖掘库
Pig ：一个支持并行计算的高级的数据流语言和执行框架
Spark ：一个快速通用的Hadoop数据的计算引擎。spark 提供一个简单和富有表现力的编程模型并支持多领域应用，包括ETL、机器学习、流处理和图计算。
Tez ：一个通用的数据流处理框架，构建在Hadoop YARN上，提供一个有力的灵活的引擎来执行一个任意的DAG任务来处理数据(批处理和交互式两种方式)。Tez 可以被Hive、Pig和其他Hadoop生态系统框架和其他商业软件（如：ETL工具）使用，用来替代Hadoop MapReduce 作为底层的执行引擎。
ZooKeeper ：一个应用于分布式应用的高性能的协调服务。

/****************************************************************************/

Ambari：

Ambari监控页面：

官网原文：

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Ambari enables System Administrators to:

Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to:

Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

译文：Apache Ambari 项目的目标是通过开发提供、管理和监测Hadoop集群的软件使得hadoop的管理更简单。
Ambari 提供了直观的，易于使用的hadoop 管理的WEB 接口依赖于他自己的RESTful API。
Ambari 帮助系统管理员：
1.提供Hadoop集群
Ambari 提供一步步的向导来安装任意数量主机的hadoop 服务群。
Ambari 管理集群的Hadoop服务群的配置
2.管理Hadoop集群
Ambari 提供控制管理整个集群的启动、停止、和重新配置Hadoop服务群
3.监测Hadoop集群
Ambari 提供了仪表盘来监测Hadoop的健康和Hadoop集群的状态
Ambari利用Ambari度量系统来度量数据收集
Ambari利用Ambari警报框架为系统报警，当你需要注意时通知你(比如:一个节点挂掉、剩余磁盘不足等等).