Hadoop学习笔记（二）HDFS

最新推荐文章于 2021-06-03 17:39:45 发布

陌上疏影凉

最新推荐文章于 2021-06-03 17:39:45 发布

阅读量486

点赞数 1

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/momo1005/article/details/78243213

版权

本文详细介绍了Hadoop的分布式文件系统HDFS的设计目标，包括运行在普通硬件上、支持大规模数据集、流式数据访问等。同时，文章阐述了HDFS的架构，包括NameNode和DataNode的角色，以及文件系统的命名空间和数据复制机制。还提到了HDFS的安装和常见的HDFS shell命令，如put、ls、mkdir、rm、get等。

摘要由CSDN通过智能技术生成

HDFS的设计目标

通过上一篇文章的介绍我们已经了解到HDFS到底是怎样的东西，以及它是怎样通过多副本机制来提供高可靠性的，我们可以发现HDFS设计目标可以总结为以下几点：

非常巨大的分布式文件系统
运行在普通廉价的硬件上
易扩展、为用户提供性能不错的文件存储服务

HDFS的架构

我们通过官网的文档来了解HDFS的基础架构（http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)：

Introduction

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/.

这段是HDFS的基本介绍，Hadoop分布式文件系统是一个设计可以运行在廉价硬件的分布式系统。它跟目前存在的分布式系统有很多相似之处。然而，不同之处才是重要的。HDFS是一个高容错和可部署(deployed)在廉价机器上的系统。HDFS提供高吞吐(hign throughout)数据能力适合处理大量数据。HDFS松散了一些需求使得支持流式传输。HDFS原本是为Apache Butch的搜索引擎设计的，现在是Apache Hadoop Core项目的子项目。

Assumptions and Goals

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

硬件失效，硬件失效是常态而不是意外。HDFS实例可能包含上百成千个服务器，每个节点存储着文件系统的部分数据。事实是集群有大量的节点，而每个节点都存在一定的概率失效也就意味着HDFS的一些组成部分经常失效。因此，检测错误、快速和自动恢复是HDFS的核心架构。

Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

流式数据访问，应用运行在HDFS需要允许流式访问它的数据集。这不是普通的应用程序运行在普通的文件系统上。HDFS是被设计用于批量处理而非用户交互。设计的重点是高吞吐量访问而不是低延迟数据访问(low latency of data access)。POSIX语义在一些关键领域是用来提高吞吐量。

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

大数据集，运行在HDFS的应用程序有大数据集。一个典型文档在HDFS是GB到TB级别的。因此,HDFS是用来支持大文件。它应该提供高带宽和可扩展(scale)到上百节点在一个集群中。它应该支持在一个实例中有以千万计的文件数。

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

简单一致模型，HDFS应用需要一个一次写入多次读取的文件访问模型。一个文件一旦创建，写入和关闭都不需要改变除了追加和截断(truncate)。支持在文件的末端进行追加数据而不支持在文件的任意位置进行修改。这个假设简化了数据一致性问题和支持高吞吐量的访问。一个Map/Reduce任务或者web爬虫(crawler)完美匹配了这个模型。

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

移动计算比移动数据更划算，如果应用的计算在它要操作的数据附近执行那就会更高效。尤其是数据集非常大的时候。这将最大限度地减少网络拥堵(congestion)和提高系统的吞吐量。这个假设是，在应用运行中，移动计算到要操作的数据附近往往比移动数据数据更好。HDFS提供接口让应用去移动计算到数据所在的位置。