本地运行
- 1.新建项目,maven引入如下依赖
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.3</version>
</dependency>
- 2.新建如下scala类,注意System.setProperty(“HADOOP_USER_NAME”, “hdfs”)为你hdfs中有权限的用户,然后直接运行即可
object TestWordCount {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "hdfs")
val time = System.currentTimeMillis()
val inPath = "hdfs://t1:8020/user/admin/test_in/word.txt"
val outPath = "hdfs://t1:8020//user/admin/test_out/"
val conf = new SparkConf().setAppName("word_count").setMaster("local")
val sc = new SparkContext(conf)
sc.textFile(inPath).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _, 1).sortBy(_._2).saveAsTextFile(outPath)
sc.stop()
val cost = System.currentTimeMillis()-time
println(s"cost $cost ms")
}
}
集群运行
- 1.切换到hdfs用户,然后启动spark-shell(如果你用scala)
su hdfs
/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/bin/spark-shell
- 2.将你要运行的代码直输入并回车,即可看到运行结果
sc.textFile("hdfs://t1:8020/user/admin/test_in/word.txt").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _, 1).sortBy(_._2).saveAsTextFile("hdfs://t1:8020//user/admin/test_out/")
其中sc就是spark shell自带的SparkContext