Hadoop结构认识以及相关作用

最新推荐文章于 2024-10-09 15:24:44 发布

爱折腾的Albert

最新推荐文章于 2024-10-09 15:24:44 发布

阅读量188

点赞数

分类专栏：机器学习文章标签： hadoop

本文链接：https://blog.csdn.net/qq_36325121/article/details/108761575

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Apache Hadoop是一个开源的分布式计算框架，包括HDFS、YARN和MapReduce。HDFS是高容错、低成本的分布式文件系统，提供高吞吐量的数据访问。YARN将资源管理和任务调度分离，实现更高效的集群资源利用。MapReduce则为大规模数据处理提供编程模型。这三个组件共同构建了Hadoop的大数据处理能力。

摘要由CSDN通过智能技术生成

文章目录

Hadoop

Hadoop

官方介绍
http://hadoop.apache.org/

The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be prone
to failures.

hadoop是一个开源的软件库

可靠
可扩展
分布式

主要模块

HDFS
MapReduce
Yarn

HDFS

介绍
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/.

总结几点

高容错能力
低硬件成本
高吞吐量
适用于大数据集

HDFS结构
在这里插入图片描述

Namenode
- 一个HDFS集群只有一个Namenode 。
- 管理命名空间及规范文件的访问。
Datanode
- 一个HDFS存在多个Datanode。
- 一个节点有一个Datanode。
- 文件的读取，写入，删除，以及分块的添加和删除。

YARN

在这里插入图片描述
介绍
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

yarn将资源管理功能和任务调度/监控功能分开

组成

ResourceManager：主要组件Scheduler and ApplicationsManager
- Scheduler ：给各种运行的程序分配资源
- ApplicationsManager：根据Scheuler发布的一些任务，执行程序

MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.