hadoop创建java项目的步骤_为 Apache Hadoop 创建 Java MapReduce - Azure HDInsight | Microsoft Docs...

最新推荐文章于 2022-11-08 21:47:56 发布

weixin_39862097

最新推荐文章于 2022-11-08 21:47:56 发布

阅读量138

点赞数

文章标签： hadoop创建java项目的步骤

本文链接：https://blog.csdn.net/weixin_39862097/article/details/114673683

版权

您现在访问的是微软AZURE全球版技术文档网站，若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站，请访问 https://docs.azure.cn.

为 HDInsight 上的 Apache Hadoop 开发 Java MapReduce 程序Develop Java MapReduce programs for Apache Hadoop on HDInsight

01/16/2020

本文内容

了解如何使用 Apache Maven 创建基于 Java 的 MapReduce 应用程序，并使用 Azure HDInsight 中的 Apache Hadoop 运行它。Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.

先决条件Prerequisites

根据 Apache 要求正确安装的 Apache Maven。Apache Maven properly installed according to Apache. Maven 是 Java 项目的项目生成系统。Maven is a project build system for Java projects.

配置开发环境Configure development environment

本文使用的环境是一台运行 Windows 10 的计算机。The environment used for this article was a computer running Windows 10. 命令在命令提示符下执行，各种文件使用记事本进行编辑。The commands were executed in a command prompt, and the various files were edited with Notepad. 针对环境进行相应的修改。Modify accordingly for your environment.

在命令提示符下，输入以下命令以创建工作环境：From a command prompt, enter the commands below to create a working environment:

IF NOT EXIST C:\HDI MKDIR C:\HDI

cd C:\HDI

创建 Maven 项目Create a Maven project

输入以下命令，以创建名为 wordcountjava 的 Maven 项目：Enter the following command to create a Maven project named wordcountjava:

mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

此命令将使用 artifactID 参数指定的名称(此示例中为 wordcountjava)创建目录。此目录包含以下项：This command creates a directory with the name specified by the artifactID parameter (wordcountjava in this example.) This directory contains the following items:

pom.xml - 项目对象模型 (POM)，其中包含用于生成项目的信息和配置详细信息。pom.xml - The Project Object Model (POM) that contains information and configuration details used to build the project.

src\main\java\org\apache\hadoop\examples:包含应用程序代码。src\main\java\org\apache\hadoop\examples: Contains your application code.

src\test\java\org\apache\hadoop\examples:包含应用程序的测试。src\test\java\org\apache\hadoop\examples: Contains tests for your application.

删除生成的示例代码。Remove the generated example code. 输入以下命令，删除生成的测试和应用程序文件 AppTest.java 与 App.java：Delete the generated test and application files AppTest.java, and App.java by entering the commands below:

cd wordcountjava

DEL src\main\java\org\apache\hadoop\examples\App.java

DEL src\test\java\org\apache\hadoop\examples\AppTest.java

更新项目对象模型Update the Project Object Model

For a full reference of the pom.xml file, see https://maven.apache.org/pom.html. 输入以下命令打开 pom.xml：Open pom.xml by entering the command below:

notepad pom.xml

添加依赖项Add dependencies

在 pom.xml 的节中添加以下文本：In pom.xml, add the following text in the section:

org.apache.hadoop

hadoop-mapreduce-examples

2.7.3

provided

org.apache.hadoop

hadoop-mapreduce-client-common

2.7.3

provided

org.apache.hadoop

hadoop-common

2.7.3

provided

这会定义具有特定版本(在中列出)的库(在中列出)。This defines required libraries (listed within ) with a specific version (listed within ). 在编译时，会从默认 Maven 存储库下载这些依赖项。At compile time, these dependencies are downloaded from the default Maven repository. 可使用 Maven 存储库搜索来查看详细信息。You can use the Maven repository search to view more.

provided 告诉 Maven 这些依赖关系不应与此应用程序一起打包，因为它们会在运行时由 HDInsight 群集提供。The provided tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.

重要

使用的版本应与群集上存在的 Hadoop 版本匹配。The version used should match the version of Hadoop present on your cluster. 有关版本的详细信息，请参阅 HDInsight 组件版本控制文档。For more information on versions, see the HDInsight component versioning document.

生成配置Build configuration

Maven 插件可用于自定义项目的生成阶段。Maven plug-ins allow you to customize the build stages of the project. 此节用于添加插件、资源和其他生成配置选项。This section is used to add plug-ins, resources, and other build configuration options.

将以下代码添加到 pom.xml 文件，然后保存并关闭该文件。Add the following code to the pom.xml file, and then save and close the file. 此文本必须位于文件中的 ... 标记内，例如和之间。This text must be inside the ... tags in the file, for example, between and .

org.apache.maven.plugins

maven-shade-plugin

2.3

package

shade

org.apache.maven.plugins

maven-compiler-plugin

3.6.1

1.8

本部分配置 Apache Maven 编译器插件和 Apache Maven 阴影插件。This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. 该编译器插件用于编译拓扑。The compiler plug-in is used to compile the topology. 该阴影插件用于防止在由 Maven 构建的 JAR 程序包中复制许可证。The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. 此插件用于防止 HDInsight 群集在运行时出现“重复的许可证文件”错误。This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. 将 maven-shade-plugin 用于 ApacheLicenseResourceTransformer 实现可防止发生此错误。Using maven-shade-plugin with the ApacheLicenseResourceTransformer implementation prevents the error.

maven-shade-plugin 还会生成 uber jar，其中包含应用程序所需的所有依赖项。The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application.

保存 pom.xml 文件。Save the pom.xml file.

创建 MapReduce 应用程序Create the MapReduce application

输入以下命令，以创建并打开新文件 WordCount.java。Enter the command below to create and open a new file WordCount.java. 根据提示选择“是”，以创建新文件。Select Yes at the prompt to create a new file.

notepad src\main\java\org\apache\hadoop\examples\WordCount.java

将以下 Java 代码复制并粘贴到新文件中。Then copy and paste the java code below into the new file. 然后关闭该文件。Then close the file.

package org.apache.hadoop.examples;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper

extends Mapper{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount ");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

请注意，包名为 org.apache.hadoop.examples，类名为 WordCount。Notice the package name is org.apache.hadoop.examples and the class name is WordCount. 提交 MapReduce 作业时，会使用这些名称。You use these names when you submit the MapReduce job.

生成并打包应用程序Build and package the application

在 wordcountjava 目录中，使用以下命令来构建包含应用程序的 JAR 文件：From the wordcountjava directory, use the following command to build a JAR file that contains the application:

mvn clean package

该指令会清除任何以前构建的项目，下载任何尚未安装的依赖项，并构建和打包应用程序。This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.

命令完成之后，wordcountjava/target 目录包含一个名为 wordcountjava-1.0-SNAPSHOT.jar 的文件。Once the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

备注

wordcountjava-1.0-SNAPSHOT.jar 文件是一种 uberjar，其中不仅包含 WordCount 作业，还包含该作业在运行时需要的依赖项。The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

上传 JAR 并运行作业 (SSH)Upload the JAR and run jobs (SSH)

以下步骤使用 scp 将 JAR 复制到 Apache HBase on HDInsight 群集的主要头节点。The following steps use scp to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. 然后使用 ssh 命令连接到群集并直接在头节点上运行示例。The ssh command is then used to connect to the cluster and run the example directly on the head node.

将该 jar 上传到群集。Upload the jar to the cluster. 将 CLUSTERNAME 替换为 HDInsight 群集名称，然后输入以下命令：Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:

scp target/wordcountjava-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:

连接到群集。Connect to the cluster. 将 CLUSTERNAME 替换为 HDInsight 群集名称，然后输入以下命令：Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:

ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net

在 SSH 会话中，使用以下命令运行 MapReduce 应用程序：From the SSH session, use the following command to run the MapReduce application:

yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout

此命令启动 WordCount MapReduce 应用程序。This command starts the WordCount MapReduce application. 输入文件是 /example/data/gutenberg/davinci.txt，输出目录是 /example/data/wordcountout。The input file is /example/data/gutenberg/davinci.txt, and the output directory is /example/data/wordcountout. 输入文件和输出均存储到群集的默认存储中。Both the input file and output are stored to the default storage for the cluster.

作业完成后，使用以下命令查看结果：Once the job completes, use the following command to view the results:

hdfs dfs -cat /example/data/wordcountout/*

会收到包含单词和计数的列表，其值类似于以下文本：You should receive a list of words and counts, with values similar to the following text:

zeal 1

zelus 1

zenith 2

后续步骤Next steps

在本文档中，学习了如何开发 Java MapReduce 作业。In this document, you have learned how to develop a Java MapReduce job. 请参阅以下文档，了解使用 HDInsight 的其他方式。See the following documents for other ways to work with HDInsight.

weixin_39862097

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop创建java项目的步骤_为 Apache Hadoop 创建 Java MapReduce - Azure HDInsight | Microsoft Docs...

您现在访问的是微软AZURE全球版技术文档网站，若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站，请访问 https://docs.azure.cn.为 HDInsight 上的 Apache Hadoop 开发 Java MapReduce 程序Develop Java MapReduce programs for Apache Hadoop on HDInsight0...
复制链接

扫一扫