apache hadoop_Apache Hadoop框架快速概述

最新推荐文章于 2024-06-17 15:53:56 发布

cumi7754

最新推荐文章于 2024-06-17 15:53:56 发布

阅读量276

点赞数

文章标签：大数据 java hadoop 编程语言 python

原文链接：https://www.freecodecamp.org/news/a-quick-overview-of-the-apache-hadoop-framework/

版权

Apache Hadoop是一个开源框架，用于使用简单编程模型在计算机集群上分布式处理大规模数据集。它旨在从单服务器扩展到数千台机器，提供高可用性服务。Hadoop包括HDFS（分布式文件系统）和MapReduce，用于并行处理数据，减少处理时间。此外，Hadoop生态系统包含一系列互补软件，如Apache Hive、Spark和ZooKeeper，以增强其功能。

摘要由CSDN通过智能技术生成

apache hadoop

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

Hadoop(现称为Apache Hadoop)以联合创始人Doug Cutting儿子的玩具大象的名字命名。 Doug选择了开源项目的名称，因为它很容易在搜索结果中拼写，发音和查找。启发该名称的原始黄色绒毛大象出现在Hadoop的徽标中。

什么是Apache Hadoop？ (What is Apache Hadoop?)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架，该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。它旨在从单个服务器扩展到数千台机器，每台机器都提供本地计算和存储。该库本身不依赖于硬件来提供高可用性，而是被设计用来检测和处理应用程序层的故障，因此可以在计算机集群的顶部提供高可用性服务，每台计算机都容易出现故障。

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架，该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。它旨在从单个服务器扩展到数千台机器，每台机器都提供本地计算和存储。库本身不依赖于硬件来提供高可用性，而是被设计用来检测和处理应用程序层的故障，因此在计算机集群的顶部提供高可用性服务，每台计算机都容易出现故障。

Source: Apache Hadoop

来源： Apache Hadoop

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

2003年，Google发布了有关Google文件系统(GFS)的论文。它详细介绍了一个专有的分布式文件系统，该系统旨在使用商品硬件提供对大量数据的有效访问。一年后，谷歌发布了另一篇名为“ MapReduce：大型集群上的简化数据处理”的论文。当时，道格在Yahoo工作。这些论文是他的开源项目Apache Nutch的灵感来源。 2006年，当时称为Hadoop的项目组件从Apache Nutch移出并发布了。

Hadoop为什么有用？ (Why is Hadoop useful?)

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

每天，都会以各种形式创建数十亿兆的数据。经常创建的数据的一些示例是：

Metadata from phone usage
来自电话使用情况的元数据
Website logs
网站日志
Credit card purchase transactions
信用卡购买交易
Social media posts
社交媒体帖子
Videos
影片
Information gathered from medical devices
从医疗设备收集的信息

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

“大数据”是指太大或太复杂而无法使用传统软件应用程序进行处理的数据集。导致数据复杂性的因素包括数据集的大小，可用处理器的速度以及数据的格式。

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

在发布时，Hadoop能够处理比传统软件更大的数据。

核心Hadoop (Core Hadoop)

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

数据存储在Hadoop分布式文件系统(HDFS)中。通过使用map reduce，Hadoop可以并行块(同时处理多个部分)而不是单个队列中处理数据。这减少了处理大型数据集所需的时间。

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

HDFS的工作原理是将大文件分为多个块，然后在许多服务器之间复制它们。拥有多个文件副本会创建冗余，从而防止数据丢失。

Hadoop生态系统 (Hadoop Ecosystem)

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

存在许多其他软件包来补充Hadoop。这些程序包括Hadoop生态系统。有些程序使将数据加载到Hadoop集群变得更容易，而其他程序使Hadoop易于使用。

The Hadoop Ecosystem includes:

Hadoop生态系统包括：

Apache Hive
阿帕奇蜂巢
Apache Pig
阿帕奇猪
Apache HBase
Apache HBase
Apache Phoenix
阿帕奇凤凰
Apache Spark
Apache Spark
Apache ZooKeeper
Apache ZooKeeper
Cloudera Impala
Cloudera Impala
Apache Flume
阿帕奇水槽
Apache Sqoop
Apache Sqoop
Apache Oozie
阿帕奇·奥兹(Apache Oozie)

更多信息： (More Information:)

Apache Hadoop
阿帕奇Hadoop

翻译自: https://www.freecodecamp.org/news/a-quick-overview-of-the-apache-hadoop-framework/

apache hadoop

cumi7754

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
apache hadoop_Apache Hadoop框架快速概述

apache hadoopHadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell,...
复制链接

扫一扫