文章目录
新建项目
选择Maven
直接下一步
GroupId默认就可(可与根据需要修改)
点击Finish后会报错:Cannot resolve plugin org.apache.maven.plugins:maven-clean-plugin:2.5
出现如下错误是因为本地maven的配置文件和仓库地址不一致
错误解决办法
-
Step1: File -> Settings:
-
Step2:点击Bulid Execution Deployment
-
Step3:选择Bulid Tools -> Maven,发现本地maven的配置文件和仓库地址不一致
切换到C:\Users\32429.m2下,发现该目录下只有respository文件,没有settings.xml,如下所示:
先将C:\Users\32429.m2目录下的repository文件夹复制到目录E:\maven\apache-maven-3.6.3-bin\apache-maven-3.6.3下,如下所示:
然后找到本地下载的maven地址中的conf目录
在conf目录中找到settings.xml并将其复制到上一级目录下
即将E:\maven\apache-maven-3.6.3-bin\apache-maven-3.6.3\conf目录下的settings.xml复制到E:\maven\apache-maven-3.6.3-bin\apache-maven-3.6.3下
最后再在IDEA中更改本地maven的配置文件和仓库地址
点击OK后,问题解决
新建scala目录
在src/main新建scala目录,将其提升为Sources Root; 然后在src/test再新建scala目录,将其提升为Test Sources Root
配置settings.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Testspark</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>15</maven.compiler.source>
<maven.compiler.target>15</maven.compiler.target>
<scala.version>2.12.10</scala.version>
<hadoop.version>3.2.0</hadoop.version>
<spark.version>3.1.1</spark.version>
<hanlp.version>portable-1.8.0</hanlp.version>
<scopt.version>3.3.0</scopt.version>
<slf4j-api.version>1.7.21</slf4j-api.version>
<slf4j-log4j12.version>1.7.21</slf4j-log4j12.version>
<log4j.version>1.2.17</log4j.version>
<junit.version>4.12</junit.version>
</properties>
<dependencies>
<!-- scala环境,有了spark denpendencies后可以省略-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<!--Spark环境-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.1.1</version>
<scope>provided</scope>
</dependency>
<!--Hadoop-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.1.1</version>
</dependency>
<!-- HanLP中文自然语言处理包 -->
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>${hanlp.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j-api.version}</version>
</dependency>
<!-- 日志框架 -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.ansj/ansj_seg -->
<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg</artifactId>
<version>5.1.6</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.5.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>compile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
注意目录结构和配置的一致
重新加载项目
配置scala环境
一定要注意这里的Ivy版本与spark-core的版本对应
新建数据目录data,在data目录下新建word.txt
在src/main/scala下新建WordCount.scala
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
//1.创建SparkConf对象,设置appName和Master地址
//local[2]: 表示本地模拟2个线程去运行程序
val conf = new SparkConf().setMaster("local[2]").setAppName("wordcount")
//2.创建SparkContext对象,它是所有任务计算的源头,它会创建DAGScheduler和TaskScheduler
//spark对象是所有spark程序的执行入口
val sc = new SparkContext(conf)
//3.读取数据文件,RDD可以通过简单的理解为是一个集合,集合中存放的元素是String
val inputRdd = sc.textFile("./data/word.txt")
//4.切分每一份,获取所有单词
val spiltRdd = inputRdd.flatMap(_.split(" "))
//5. 每个单词记为1,转换单词(单词,1)
val pairRDD = spiltRdd.map(x => (x, 1))
//6.相同单词汇总,前一个下划线表示累加数据,后一个下划线表示数据
val resultRdd = pairRDD.reduceByKey(_ + _)
//7.收集打印结果数据
resultRdd.collect().foreach(println)
//8.关闭sparkContext对象
sc.stop()
}
}
结果如下: