数据科学与大数据分析学习笔记-12MapReduce and Hadoop

这一章只针对考试内容进行简单理论介绍,深入学习部分将放在别的章节中。

The Apache Hadoop software library
– A framework that allows for the distributed processing of large datasets by a compute cluster.
Unstructured data (text, image, video, etc.)– Data that has no inherent, consistent structure。

Apache Hadoop implements MapReduce.
• MapReduce paradigm offers the means to

  1. Break a large task into smaller tasks.
  2. Run tasks in parallel.
  3. Consolidate the outputs of the individual tasks into the final output.

MapReduce

如前文所述,MapReduce 计算模型将一个大任务分解成较小任务然后并行地运行,最后将每个任务的输出结果整合到最终结果中。

MapReduce consists of two basic steps:

  1. Map step
    • Applies an operation to a piece of data.
    • Provides some intermediate output.
  2. Reduce step
    • Consolidates the above intermediate outputs.
    • Provides the final output.
    • Each step uses key/value pairs as input and as output, denoted as <key, value>
    • The pairs can take complex forms
    – For example, the key is a filename, and the value is the entire content of the file.

在这里插入图片描述
这一部分与之前所学的词袋有所类似。
• The map step parse the string into single words and emits a set of key/value pairs in the form <word, 1>;
• For each unique key, the reduce step sums the 1 values and outputs the <word, count> key/value pairs;

MapReduce
– Has the advantage of being able to distribute the workload over a cluster of computers and run the tasks in parallel.
– The documents, or even their pieces, could be processed simultaneously during the map step.
– The processing of one portion of the input can be carried out independently of the others.

尽管 MapReduce 很容易理解,但是它并不容易实现,特别是在分布式系统中。执行一个MapReduce 任务(即在特定数据上运行 MapReduce 代码)需要管理和协调多个活动。
MapReduce 任务需要根据系统负载来调度。

 需要对任务进行监控和管理,以确保能妥善处理遇到的任何错误,使得在系统部分失效的情 况下任务依然可以继续执行。

 输入数据需要分布到集群节点上。

 处理输入数据的 Map 步骤需要分布式地来进行,最好是在存放输入数据的相同机器上。

 众多 Map 步骤的中间结果需要被收集起来,并提供给适当的机器以执行 Reduce 步骤。

 最终结果需要可以被其他用户、其他应用程序,或者其他 MapReduce 任务所使用。

• Scheduling jobs, monitoring jobs
• Spreading data, conducting the map step across
• Collecting numerous intermediate outputs
• Making the final output
Apache Hadoop可以做到以上这些。

– Handles these activities and more, and make most of them transparent to the users.
– An open source project managed and licenced by the Apache Software Foundation.

Hadoop Distributed File System (HDFS)

– Provides the capability to distribute data across a cluster to use parallel processing of MapReduce
– HDFS is not an alternative to common file systems, but depends on each disk drive’s file system
– HDFS breaks a file into blocks and stores the blocks across the clusters.
• The blocks of a file are stored on different machines
• By default, creates three copies of each block across the cluster (redundancy for fault tollerance).

对于一个给定的文件,HDFS 把文件分为 64MB 大小的块并存储在集群中。如果文件大小为 300MB,则该文件被存储在 5 个块中:4 个 64MB 块和一个 44MB 块。如果文件小于 64MB,那么该块大小等于文件大小。

只要有可能,HDFS 就会试图把文件块存储在不同机器上,这样 Map 步骤可以并行地操作一个文件的每个块。而且在默认情况下,HDFS 为每个块创建三个分散在集群中的副本,以在发生故障时提供必要的冗余。如果一个机器出现故障,HDFS 将相关文件块的可访问副本复制到另一台可用的机器上。HDFS 也是机架感知的,这意味着它可以把文件块分布到多个机架设备上,以防止整个机架故障而导致数据不可用。此外,每个文件块的三个副本使得 Hadoop 可以灵活地确定在 Map 步骤使用哪个机器来处理特定的块。

HDFS uses three Java daemons (background processes)

– NameNode (the master node) determines and tracks where the blocks of a data file are stored. It runs on a single machine and resides in its memory.
– DataNode (worker nodes) manages data stored on each machine.
– Secondary NameNode provides the capability to perform some of the NameNode tasks to reduce the load of NameNode.

NameNode 守护进程在单台机器上运行,确定并跟踪一个数据文件的不同块被存储的位置。DataNode 守护进程管理每台机器上存储的数据。如果客户端应用程序要访问存储在 HDFS 上的一个特定文件,该应用程序可以联系 NameNode,然后 NameNode 为该应用程序提供这个文件的不同块所在的位置。然后应用程序就可以与相应的 DataNode 通信来访问文件了。

第三个守护进程 Secondary NameNode,提供了执行某些 NameNode 任务的功能,以减轻NameNode 的负担。这样的任务包括用文件系统编辑日志的内容来更新文件系统映像。需要重点注意的是,Secondary NameNode 不是 NameNode 的备份或冗余。当 NameNode 瘫痪时,必须重新启动 NameNode,并使用最近的文件系统映像文件和编辑日志的内容来初始化
在这里插入图片描述
图中是一个有 10 台机器的 Hadoop 集群,存储了一个有 3 个 HDFS 数据块的大文件。此外,文件块有三个副本。运行 NameNode 和 Secondary NameNode 的机器被认为是主节点(master node。

因为 DataNode 会接受来自主节点的命令,所以运行 DataNode 的机器被称作工人节点(worker node。

The Hadoop Ecosystem

Hadoop’s proprietary and open source tools
– Make Apache Hadoop easier to use.
– Provide additional functionality and features.

Hadoop-related Apache projects
– Pig: A high-level data-flow programming language.
– Hive: Provide SQL-like access.
– Mahout: Provide analytical tools.
– Hbase: Provides real-time reads and writes.

• Apache Pig

– Consists of a data flow language, Pig Latin, and an environment to execute the Pig code.
– Benefit is simplifying the tasks of developing and executing a MapReduce job.
– When Pig commands are executed, the running of a job at background is transparent to the user.
– Three main characteristics
• Ease of programming, behind-the-scenes code optimization, and extensibility of capabilities.

• Apache Hive

– Similar to Pig, Hive enables users to process data without explicitly writing MapReduce code.
– One key difference to Pig
• Hive language (HiveQL) resembles Structured Query Language (SQL) rather than a script language.
– Hive may be a good tool to use if
• Data easily fits into a table structure.
• Data is already in HDFS.
• Developers are comfortable with SQL.

• Apache HBase

– Pig and Hive are intended for batch applications.
– Differently, Hbase provides real-time read and write access to large-scale datasets.
– Is built upon HDFS.
– Share the workload over a large number of nodes in a distributed cluster.
– Uses a key/value structure to store the contents of an HBase table.

Apache Mahout

– Tools such as R may suffer from performance issues with the large datasets stored in Hadoop.
– Supports the application of analytical techniques within the Hadoop environment.
– Provides Java libraries to apply analytical techniques in a scalable manner to Big Data.
– Implemented algorithms
• Classification, clustering, and collaborative filtering.

这个 Apache 项目提供了 Java 库,以便以可扩展的方式将分析技术应用到大数据中。mahout(驯
象师)是指能控制大象的人。Apache Mahout 就是一个引导 Hadoop 这个大象产生有意义的分析
结果的工具集。
Mahout 提供的 Java 代码实现了若干种技术的算法,这些技术分为如下三个类别。分类:
逻辑回归;
 朴素贝叶斯;
 随机森林;
 隐马尔科夫模型
聚类:
 冠层聚类;
 k 均值聚类;
 模糊 k 均值;
期望最大化(EM。
推荐系统/协同过滤:
 非分布式推荐系统;
 基于项的分布式协同过滤。

NoSQL (Not only Structured Language)

– A term used to describe those data stores that are applied to unstructured data.
– As the size of data grows, the solution can scale by adding machines to the distributed system.
– Four major categories of NoSQL tools
• Key/value store, Document store, Column family store, Graph databases.
– The choice of data store is take-dependent.

参考书目

  1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015

  2. Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015

  3. C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.

  4. Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.

图片来自课件和个人的整理。
中文图片来自网络。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值