大数据英文考试复习——第六章（大数据处理概念）

本文链接：https://blog.csdn.net/weixin_63441326/article/details/135408516

本文介绍了并行处理和分布式数据处理的概念，重点阐述了Hadoop与MapReduce在大数据处理中的应用，包括MapReduce的工作原理和一个实际的MapReduce编程实验，如统计文本文件中单词出现次数。文章还涉及了SCV原理在分布式系统中的应用以及实验步骤，包括文件上传、权限设置、Hadoop启动和MapReduce作业执行。

摘要由CSDN通过智能技术生成

前言

第五章主要学习了大数据怎么存储数据，这一章主要讲解大数据怎么处理数据，并结合上课做过的实验来说明如何编写map和reduce程序

1.并行处理（parallel data processing）：

为了完成一个任务把任务分解成多个子任务，在一台机器上运行

translation:

In order to complete a task, the task is broken down into multiple subtasks, running on a single computer

2.分布式数据处理（distributed data processing）：

把子任务分解到多个计算机去执行（注意和并行处理区分）

translation:

Divide subtasks into multiple computers for execution

3.Hadoop与Mapreduce

Hadoop支持数据并行与任务并行；

MapReduce是一种分布式计算模型和编程框架，它采用了并行计算的思想，将数据分割成多个块，并在多台计算机上同时进行处理；

Map阶段，输入数据被切分成多个小数据块，并由多个Mapper任务并行处理，每个Mapper任务将输入数据转化为键值对(key-value)的形式，并生成中间结果；

Reduce阶段，中间结果被合并和排序，并由多个Reducer任务并行处理。每个Reducer任务按照键(key)对中间结果进行聚合、过滤和计算，生成最终的输出结果；

键值对是map与reduce的唯一通信方式；

translation:

Hadoop supports data parallel and task parallel.

MapReduce is a distributed computing model and programming framework. It adopts the idea of parallel computing, which divides data into multiple blocks and processes it on multiple computers at the same time.

In the Map phase, the input data is divided into several small data blocks and processed by multiple Mapper tasks in parallel. Each Mapper task converts the input data into the form of key-value pairs and generates intermediate results.

In the Reduce phase, intermediate results are merged and sorted, and are processed in parallel by multiple Reducer tasks. Each Reducer task aggregates, filters, and calculates intermediate results by key to generate the final output result.

Key-value pair is the only communication mode between map and reduce.

4.SCV原理（SCV principle）

CAP定理应用于分布式数据的存储，而SCV原理应用于分布式数据的处理：

SCV原理指出，分布式数据处理系统中，无法同时满足速度(Speed)、一致性(Consistency)、体积(Volume)这三个属性

translation:

The SCV principle points out that in a distributed data processing system, Speed, Consistency and Volume cannot be satisfied simultaneously

5.实验【Mapreduce programming】

5.1 实验内容：

编写一个MapReduce程序，统计输入文件中所有字的出现次数，并输出出现次数最多的字。实验报告要求如下：

（1）实验报告中要说明关键代码及自己对程序的理解，并分析说明程序为什么能够适应大数据环境。

（2）学号、姓名、课程名称、实验名称等信息必须出现在报告首页。

（3）实验报告采用宋体小四号字，1.25倍行距，排版整齐美观。

（4）提交pdf版。

5.2 实验流程：

1.上传实验文件：

使用命令：docker cp + 源文件路径 + 容器名 + 目标路径

比如：docker cp D:\mapper.py bgsvr0:/exp/mapper.py

同理上传其他文件：

docker cp D:\reducer.py bgsvr0:/exp/reducer.py
docker cp D:\text.txt bgsvr0:/exp/text.txt

2.为文件赋予可执行权限：

chmod +x /exp/mapper.py
chmod +x /exp/reducer.py

3.启动Hadoop：

docker exec -it bgsvr0 /bin/bash
/bgsys/hadoop-3.3.6/sbin/start-dfs.sh
/bgsys/hadoop-3.3.6/sbin/start-yarn.sh

4.拷贝文件到Hadoop中：

(1)首先进入Hadoop目录：

cd /bgsys
cd hadoop-3.3.6

(2)然后将Hadoop的可执行文件路径添加到PATH中，以便在终端中直接运行Hadoop命令：

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd6
export PATH=$PATH:/bgsys/hadoop-3.3.6/bin

(3)使用mkdir命令在hadoop中创建一个目录：

hdfs dfs -mkdir /exp_input

(4)使用put命令将测试文件上传到刚刚建好的目录下：

hdfs dfs -put /exp/text.txt /exp_input/（目标路径在后）

(5)使用ls命令检查文件上传是否成功：

hdfs dfs -ls /exp_input/text.txt

(6)运行命令：

指定要运行的 Hadoop Streaming 工具的路径和版本+指定要传递给 MapReduce 作业的文件+指定 Mapper 阶段要执行的命令或脚本并使用 Python 3 解释器运行/exp/mapper.py脚本作为 Mapper

+指定 Reducer阶段要执行的命令或脚本并使用 Python 3 解释器运行/exp/reducer.py脚本作为 Reducer+指定输入数据的路径和文件名+指定输出结果的路径和文件夹名称

(7)打开输出文件：

hdfs dfs -cat /exp_output/part-00000

5.3 英语答题流程：

docker cp D:\mapper.py bgsvr0:/exp/mapper.py

docker cp D:\reducer.py bgsvr0:/exp/reducer.py

docker cp D:\text.txt bgsvr0:/exp/text.txt

chmod +x /exp/mapper.py
chmod +x /exp/reducer.py

docker exec -it bgsvr0 /bin/bash
/bgsys/hadoop-3.3.6/sbin/start-dfs.sh
/bgsys/hadoop-3.3.6/sbin/start-yarn.sh

cd /bgsys
cd hadoop-3.3.6

Add the PATH of Hadoop's executable file to PATH so that you can run Hadoop commands directly in your terminal

hdfs dfs -mkdir /exp_input

hdfs dfs -put /exp/text.txt /exp_input/

hdfs dfs -ls /exp_input/text.txt

Path and version of Hadoop Streaming tool -files /exp/mapper.py,/exp/reduce.py -mapper "python3 mapper.py" -reducer "python3 reduce.py" -input /exp_input/text.txt -output /exp_output1

hdfs dfs -cat /exp_output1/part-00000