使用IDEA实现WordCount
准备阶段
创建maven项目
pom.xml
需要修改和添加的部分:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.21</version>
</dependency>
</dependencies>
之后选择Project Settings --> Libraries --> +号
如果这里Scala SDK为空,则需要手动导入
主体代码块
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partition, SparkConf, SparkContext}
object wordcount {
def main(args: Array[String]): Unit = {
//System.setProperty("hadoop.home.dir", "D:\\Hadoop2.6.0") //个人原因需要手动装载HADOOP_HOME
val conf = new SparkConf().setMaster("local[2]").setAppName("workcount")
val sc:SparkContext=SparkContext.getOrCreate(conf)
val rdd1:RDD[String]=sc.parallelize(List("hello world","hello java","Hello scala java"), 3)
rdd1.flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
val partitions:Array[Partition] = rdd1.partitions
println(partitions.length)
println("--------------------")
val lines = sc.textFile("in\\word.txt")
lines.collect.foreach(println)
val linesHDFS:RDD[String]=sc.textFile("hdfs://HadoopY:9000/kb09workspace/word.txt")
println("--------------------")
linesHDFS.collect.foreach(println)
}
}
log4j日志文件修改
在External Libraries寻找org.apache.spark:spark-core_2.11:2.1.1
点开后点进jar包,选择org文件夹,再选择apache.spark,复制该路径下的log4j-defaults.properties
到与src同级目录下创建个文件夹并放进去,重命名将default删掉
修改复制后的properties文件,第19行:
4j.rootCategory=ERROR, console