apache hadoop_Apache Hadoop框架快速概述

Apache Hadoop是一个开源框架,用于使用简单编程模型在计算机集群上分布式处理大规模数据集。它旨在从单服务器扩展到数千台机器,提供高可用性服务。Hadoop包括HDFS(分布式文件系统)和MapReduce,用于并行处理数据,减少处理时间。此外,Hadoop生态系统包含一系列互补软件,如Apache Hive、Spark和ZooKeeper,以增强其功能。
摘要由CSDN通过智能技术生成

apache hadoop

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

Hadoop(现称为Apache Hadoop)以联合创始人Doug Cutting儿子的玩具大象的名字命名。 Doug选择了开源项目的名称,因为它很容易在搜索结果中拼写,发音和查找。 启发该名称的原始黄色绒毛大象出现在Hadoop的徽标中。

什么是Apache Hadoop? (What is Apache Hadoop?)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架,该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。 它旨在从单个服务器扩展到数千台机器,每台机器都提供本地计算和存储。 该库本身不依赖于硬件来提供高可用性,而是被设计用来检测和处理应用程序层的故障,因此可以在计算机集群的顶部提供高可用性服务,每台计算机都容易出现故障。

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架,该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。 它旨在从单个服务器扩展到数千台机器,每台机器都提供本地计算和存储。 库本身不依赖于硬件来提供高可用性,而是被设计用来检测和处理应用程序层的故障,因此在计算机集群的顶部提供高可用性服务,每台计算机都容易出现故障。

Source: Apache Hadoop

来源: Apache Hadoop

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

2003年,Google发布了有关Google文件系统(GFS)的论文。 它详细介绍了一个专有的分布式文件系统,该系统旨在使用商品硬件提供对大量数据的有效访问。 一年后,谷歌发布了另一篇名为“ MapReduce:大型集群上的简化数据处理”的论文。 当时,道格在Yahoo工作。 这些论文是他的开源项目Apache Nutch的灵感来源。 2006年,当时称为Hadoop的项目组件从Apache Nutch移出并发布了。

Hadoop为什么有用? (Why is Hadoop useful?)

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

每天,都会以各种形式创建数十亿兆的数据。 经常创建的数据的一些示例是:

  • Metadata from phone usage

    来自电话使用情况的元数据
  • Website logs

    网站日志
  • Credit card purchase transactions

    信用卡购买交易
  • Social media posts

    社交媒体帖子
  • Videos

    影片
  • Information gathered from medical devices

    从医疗设备收集的信息

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

“大数据”是指太大或太复杂而无法使用传统软件应用程序进行处理的数据集。 导致数据复杂性的因素包括数据集的大小,可用处理器的速度以及数据的格式。

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

在发布时,Hadoop能够处理比传统软件更大的数据。

核心Hadoop (Core Hadoop)

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

数据存储在Hadoop分布式文件系统(HDFS)中。 通过使用map reduce,Hadoop可以并行块(同时处理多个部分)而不是单个队列中处理数据。 这减少了处理大型数据集所需的时间。

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

HDFS的工作原理是将大文件分为多个块,然后在许多服务器之间复制它们。 拥有多个文件副本会创建冗余,从而防止数据丢失。

Hadoop生态系统 (Hadoop Ecosystem)

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

存在许多其他软件包来补充Hadoop。 这些程序包括Hadoop生态系统。 有些程序使将数据加载到Hadoop集群变得更容易,而其他程序使Hadoop易于使用。

The Hadoop Ecosystem includes:

Hadoop生态系统包括:

  • Apache Hive

    阿帕奇蜂巢
  • Apache Pig

    阿帕奇猪
  • Apache HBase

    Apache HBase
  • Apache Phoenix

    阿帕奇凤凰
  • Apache Spark

    Apache Spark
  • Apache ZooKeeper

    Apache ZooKeeper
  • Cloudera Impala

    Cloudera Impala
  • Apache Flume

    阿帕奇水槽
  • Apache Sqoop

    Apache Sqoop
  • Apache Oozie

    阿帕奇·奥兹(Apache Oozie)

更多信息: (More Information:)

翻译自: https://www.freecodecamp.org/news/a-quick-overview-of-the-apache-hadoop-framework/

apache hadoop

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值