搭建的环境配置:
Ubuntu:20.04
Hadoop:3.2.1
Spark:3.0.0-preview2
Java:1.8.0_222
此环境配置,请参看在Ubuntu上安装Hadoop+Spark👈
一、Java版本:
- 在IDEA中新建Maven项目
- 在pom.xml中添加Spark Core库(参考下面的pom.xml)
- 在src/main/java目录下新建.java文件,写入程序
p.s. 如果用Java写程序,只要pom.xml中的spark core和系统的spark版本对应即可,不需要在意scala版本
参考的pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>myTest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<hadoop.version>3.2.1</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</project>
参考的java程序:
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class Tess {
public static void main(String[] args) {
//这里要写成这样,否则在Spark上运行可能发生java.net.ConnectException: Connection refused错误
//String logFile = "file:///home/lxb/Documents/spark-2.4.4-bin-hadoop2.7/README.md";
// String logFile = "hdfs://192.168.1.115:9000/ncdc_data/2002";//这是获取hdfs系统里的数据的方式(在Terminal输入hdfs getconf -confKey fs.default.name可以得到可访问的地址)
// String logFile = "C:\\Users\\LXB\\Desktop\\Spark_number.txt";
String logFile = "hdfs://localhost:9000//user//README.txt";
SparkConf conf = new SparkConf().setMaster("yarn").setAppName("Tess");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("0"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("1"); }
}).count();
System.out.println("Lines with 0: " + numAs + ", lines with 1: " + numBs);
sc.stop();
}
}
p.s. 在Terminal输入hdfs getconf -confKey fs.default.name
可以得到可访问的hdfs地址
让IDEA自动更新pom.xml中的部件,然后在IDEA中就能运行这个Spark(Java)程序了。
导出jar包的操作:
这个jar包,可以用在spark-submit
中,参看在Ubuntu上安装Hadoop+Spark👈。
p.s. 如果遇到
[Hadoop] 异常class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface…
,可以看这篇文章👈。