Hadoop Operations(Hadoop操作) 详解(一) 简介

最新推荐文章于 2024-04-13 08:18:49 发布

Noseparte

最新推荐文章于 2024-04-13 08:18:49 发布

阅读量1.3w

点赞数 1

分类专栏： hadoop Java 文章标签： hadoop 结构

本文链接：https://blog.csdn.net/noseparte/article/details/77413055

版权

本文为Hadoop Operations系列的第一部分，主要介绍了Hadoop操作的基础概念和重要性，旨在为读者提供Hadoop集群管理和运维的初步理解。

摘要由CSDN通过智能技术生成

Hadoop Operations 详解

文章目录结构如下：

Chapter 1. Introduction

Over the past few years, there has been a fundamental shift in data storage, management, and processing. Companies are storing more data from more sources in more formats than ever before. This isn’t just about being a “data packrat” but rather building products, features, and intelligence predicated on knowing more about the world (where the world can be users, searches, machine logs, or whatever is relevant to an organization). Organizations are finding new ways to use data that was previously believed to be of little value, or far too expensive to retain, to better serve their constituents. Sourcing and storing data is one half of the equation. Processing that data to produce information is fundamental to the daily operations of every modern business.

在过去的几年里，数据存储、管理和处理都发生了根本性的转变。公司从更多的资源中存储的数据比以往任何时候都多。这不仅仅是一个“数据包装器”，更重要的是构建产品、特性和基于对世界了解更多的信息(世界可以是用户、搜索、机器日志，或者任何与组织相关的东西)。组织正在寻找新的方法来使用以前被认为没有价值的数据，或者是为了更好地服务他们的选民而花费的昂贵的数据。

Data storage and processing isn’t a new problem, though. Fraud detection in commerce and finance, anomaly detection in operational systems, demographic analysis in advertising, and many other applications have had to deal with these issues for decades. What has happened is that the volume, velocity, and variety of this data has changed, and in some cases, rather dramatically. This makes sense, as many algorithms benefit from access to more data. Take, for instance, the problem of recommending products to a visitor of an ecommerce website. You could simply show each visitor a rotating list of products they could buy, hoping that one would appeal to them. It’s not exactly an informed decision, but it’s a start. The question is what do you need to improve the chance of showing the right person the right product? Maybe it makes sense to show them what you think they like, based on what they’ve previously looked at. For some products, it’s useful to know what they already own. Customers who already bought a specific brand of laptop computer from you may be interested in compatible accessories and upgrades. [ 1 ] One of the most common techniques is to cluster users by similar behavior (such as purchase patterns) and recommend products purchased by “similar” users. No matter the solution, all of the algorithms behind these options require data and generally improve in quality with more of it. Knowing more about a problem space generally leads to better decisions (or algorithm efficacy), which in turn leads to happier users, more money, reduced fraud, healthier people, safer conditions, or whatever the desired result might be.

不过，数据存储和处理并不是一个新问题。在商业和金融领域的欺诈检测、操作系统的异常检测、广告的人口分析以及许多其他的应用程序都需要几十年的时间来处理这些问题。所发生的情况是，这些数据的体积、速度和多样性发生了变化，在某些情况下，发生了相当大的变化。这是有意义的，因为许多算法从访问更多数据中受益。举个例子，向一个电子商务网站的访问者推荐产品的问题。你可以简单地向每个访问者展示一个p的旋转列表

Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infrastructure for building many of the types of applications described earlier. Made up of a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a computation layer that implements a processing paradigm called MapReduce, Hadoop is an open source, batch data processing system for enormous amounts of data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and software failures, but also treating them as first-class conditions that happen regularly. Hadoop uses a cluster of plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage and compute platform, or cluster , that can be shared by multiple individuals or groups. Computation in Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction for developers that obviates complex synchronization and network programming. Unlike many other distributed data processing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance.

Apache Hadoop是一个平台，它提供了实用的、具有成本效益的、可伸缩的基础设施，用于构建前面描述的许多应用程序类型。由称为Hadoop分布式文件系统(HDFS)的分布式文件系统和实现称为MapReduce的处理范例的计算层组成，Hadoop是一种开放源码的批处理数据处理系统，用于处理大量数据。我们生活在一个有缺陷的世界里，Hadoop的设计是为了生存，它不仅容忍硬件和软件的失败，而且还把它们当作是经常发生的一流条件。