说明:
spark版本:3.0.3
scala版本:2.12.11
一.创建Maven项目,增加scala插件:
新建一个Maven项目,版本号1.0.0,然后下一步中的项目名和Artifactid中名称一样。
然后在main.java下新建包com.sparkcore:
为项目添加scala-sdk,使之拥有scala环境:
下面为项目添加框架支持:在Add Frameworks Support中勾选scala
验证scala是否成功,编写一个scala程序验证:
可看到scala程序成功运行
二.增加spark依赖关系:
在pom.xml里面添加以下代码:
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <!-- scala的版本 --> <version>3.0.3</version><!-- spark的版本 --> </dependency> </dependencies> <build> <plugins> <!-- 该插件用于将Scala代码编译成class文件 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution> <!-- 声明绑定到maven的compile阶段 --> <goals> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.1.0</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
然后编写spark程序spark01_wordCount,用来统计单词数,需要统计的文件在datas里面,运行结果见图片所示(此处带日志信息)
package com.sparkcore import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object spark01_wordCount { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setMaster("local").setAppName("wordCountApp") val sc = new SparkContext(sparkConf) val lines = sc.textFile("datas") val words = lines.flatMap(_.split(" ")) val wordGroup1 = words.map(word => (word ,1)).reduceByKey((a,b) => a+b) wordGroup1.collect() wordGroup1.foreach(println) sc.stop() } }
三.控制台除去日志信息:
在在项目的resources目录中创建log4j.properties文件,并添加日志配置信息:
log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to ERROR. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=ERROR # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=ERROR log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
除去日志信息后的效果(只保留程序运行结果):