第一个 Spark 程序:WordCount
1. 使用 Spark-shell
-
准备数据:创建文件夹
input
,以及Words.txt
文件[zgl@hadoop101 spark-2.1.1]$ mkdir input [zgl@hadoop101 input]$ vim Words.txt
在文件中输入数据:
hello spark hello scala hello world
-
进入
spark-shell
[zgl@hadoop101 spark-2.1.1]$ bin/spark-shell
-
编写
WordCount
程序并运行scala> sc.textFile("input/").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect res0: Array[(String, Int)] = Array((scala,1), (hello,3), (world,1), (spark,1))
2. 使用开发工具 IDEA
-
创建
Maven
项目,并导入如下依赖<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.1</version> </dependency> </dependencies> <build> <plugins> <!-- 打包插件, 否则 scala 类不会编译并打包进去 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.4.6</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
-
创建
WordCount.scala
文件,实现以下代码package com.guli import org.apache.spark.{SparkConf, SparkContext} object WordCount { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setAppName("WorldCount").setMaster("local[*]") val sc = new SparkContext(conf) val wcArray: Array[(String, Int)] = sc.textFile("/Users/zgl/Desktop/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect() wcArray.foreach(println) sc.stop() } }
-
运行结果
(scala,1) (hello,3) (world,1) (spark,1)