一、求top值实例
1、首先准备数据文件,假设有两个文件,内容以逗号分隔,分别是orderid,userid,payment,productid要求求出payment TOP N个,下面给出file1.txt、file2.txt、file3.txt文件,文件内容为:
(1)file1.txt文件内容:
1,1734,43,155
2,4323,12,34223
3,5442,32,3453
4,1243,34,342
5,1223,20,342
6,542,570,64
7,122,10,123
8,42,30,345
9,152,40,1123
(2)file2.txt文件内容:
10,435,67,155
11,567,9,34223
12,765,67,3453
13,78,7,342
14,234,3,342
15,567,344,64
16,78,3422,123
17,96,3425,345
18,345,565,1123
(3)file3.txt文件内容:
19,435,2342,155
20,567,345,34223
21,765,4332,3453
22,78,231,342
23,234,5463,342
24,567,2342,64
25,78,45634,123
26,96,23,345
27,345,456,1123
2、然后将此3个文件上传至hdfs://master:9000/example,通过:hadoop fs -put file* hdfs://master:9000/example如下图所示:
3、然后新建一个项目,并配置pom.xml和TopN.scala文件:
(1)其中pom.xml内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>comprehensive-example</groupId>
<artifactId>comprehensiveexample</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.1</version>
<configuration>
<archive>
<manifest>
<mainClass>SparkWriteHBase</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
(2)其中TopN.scala内容如下:
import org.apache.spark.{SparkConf,SparkContext}
object TopN {
def main(args:Array[String]): Unit ={
val conf=new SparkConf().setAppName("TopN").setMaster("local")
val sc=new SparkContext(conf)
sc.setLogLevel("ERROR")
val lines=sc.textFile("hdfs://master:9000/example/",3)
var num=0
val result=lines.filter(l=>l.length>0&&l.split(",").length==4)//过滤出符合条件的记录
.map(l=>l.split(",")(2))//取出payment字段
.map(x=>(x.toInt,""))//将字符串转换为数字,然后返回一个key-value
.sortByKey(false)//按照降序排序
.map(x=>x._1)//取出key
.take(5)//选择前5个
.foreach(x=>{//遍历输出
num=num+1
println(num+"\t"+x)
})
}
}
(3)然后通过mvn package打包jar,然后通过:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar运行
输出结果如下:
(4)修改代码,直接以记录的方式打印出,修改代码如下:
import org.apache.spark.{SparkConf,SparkContext}
object TopN {
def main(args:Array[String]): Unit ={
val conf=new SparkConf().setAppName("TopN").setMaster("local")
val sc=new SparkContext(conf)
sc.setLogLevel("ERROR")
val lines=sc.textFile("hdfs://master:9000/example/")
var num=0
val result=lines.filter(l=>l.length>0&&l.split(",").length==4)//过滤出符合条件的记录
.map(l=>(l.split(",")(2).toInt,l))//取出payment字段,并将原记录以key-value方式存储
.sortByKey(false)//按照key降序排序
.take(5)//选择前5个
.foreach(x=>{//遍历输出
num=num+1
println(num+"\t"+x._2)
})
}
}
(5)然后重新打包jar,使用spark-submit提交,运行结果如下图所示:
二、文件排序
1、准备文件file1.txt,file2.txt,file3.txt文件
(1)file1.txt文件内容为:
33
37
12
40
(2)file2.txt文件内容为:
4
16
39
5
(3)file3.txt文件内容为:
1
45
25
2、新建一个FileSort.scala文件,文件内容如下:
import org.apache.spark.{SparkConf,SparkContext,HashPartitioner}
object FileSort {
def main(args:Array[String]): Unit ={
val conf=new SparkConf().setAppName("FileSort")
val sc=new SparkContext(conf)
val lines=sc.textFile("file:///sunxj/work/spark/data",3)//读取文件
var idx=0
val result=lines.filter(l=>l.trim.length>0)//过滤空行
.map(x=>(x.trim.toInt,""))//转换成Int型并以key-value方式返回
.partitionBy(new HashPartitioner(1))//将三个分区文件合并成一个分区
.sortByKey()//合并
.map(
x=>{
idx=idx+1
(idx,x._1)//将索引和值以key-value方式返回
}
)
result.saveAsTextFile("file:///sunxj/work/spark/result/")//保存到本地/sunxj/work/spark/result/目录中
}
}
3、然后将pom.xml中的mainClass标签TopN更换为FileSort即可,然后使用mvn package打包jar,然后提交到spark集群上,查看输出文件如下图所示 :
注意:输出文件结果存放的目录不能再输入目录内,比如:file:///sunxj/work/spark/data(输入)而输出不能再data内,比如:file:///sunxj/work/spark/data/result(输出目录),否则会出现不是一个文件,如下错误信息:
2019-10-06 13:15:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-06 13:15:52 INFO SparkContext:54 - Running Spark version 2.4.0
2019-10-06 13:15:52 INFO SparkContext:54 - Submitted application: FileSort
2019-10-06 13:15:52 INFO SecurityManager:54 - Changing view acls to: sunxiaoju
2019-10-06 13:15:52 INFO SecurityManager:54 - Changing modify acls to: sunxiaoju
2019-10-06 13:15:52 INFO SecurityManager:54 - Changing view acls groups to:
2019-10-06 13:15:52 INFO SecurityManager:54 - Changing modify acls groups to:
2019-10-06 13:15:52 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sunxiaoju); groups with view permissions: Set(); users with modify permissions: Set(sunxiaoju); groups with modify permissions: Set()
2019-10-06 13:15:53 INFO Utils:54 - Successfully started service 'sparkDriver' on port 52397.
2019-10-06 13:15:53 INFO SparkEnv:54 - Registering MapOutputTracker
2019-10-06 13:15:53 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-10-06 13:15:53 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-10-06 13:15:53 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-10-06 13:15:53 INFO DiskBlockManager:54 - Created local directory at /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/blockmgr-9484bbd8-e040-4fac-a652-42d8c1561641
2019-10-06 13:15:53 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-10-06 13:15:53 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-10-06 13:15:53 INFO log:192 - Logging initialized @2900ms
2019-10-06 13:15:53 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-10-06 13:15:53 INFO Server:419 - Started @3040ms
2019-10-06 13:15:53 INFO AbstractConnector:278 - Started ServerConnector@3caa4757{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-10-06 13:15:53 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62c5bbdc{/jobs,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c321bdb{/jobs/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24855019{/jobs/job,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4d4d8fcf{/jobs/job/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@610db97e{/stages,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f0628de{/stages/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3fabf088{/stages/stage,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ced35ed{/stages/stage/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c22a348{/stages/pool,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7bd69e82{/stages/pool/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74d7184a{/storage,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51b01960{/storage/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6831d8fd{/storage/rdd,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27dc79f7{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b85300e{/environment,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3aaf4f07{/environment/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5cbf9e9f{/executors,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@18e8473e{/executors/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a2f016d{/executors/threadDump,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a38ba58{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3ad394e6{/static,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4acf72b6{/,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7561db12{/api,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e9c413e{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@57a4d5ee{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://192.168.3.33:4040
2019-10-06 13:15:53 INFO SparkContext:54 - Added JAR file:/sunxj/work/spark/comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar at spark://192.168.3.33:52397/jars/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1570338953865
2019-10-06 13:15:53 INFO Executor:54 - Starting executor ID driver on host localhost
2019-10-06 13:15:54 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52398.
2019-10-06 13:15:54 INFO NettyBlockTransferService:54 - Server created on 192.168.3.33:52398
2019-10-06 13:15:54 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-10-06 13:15:54 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO BlockManagerMasterEndpoint:54 - Registering block manager 192.168.3.33:52398 with 366.3 MB RAM, BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@34523d46{/metrics/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:55 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 242.5 KB, free 366.1 MB)
2019-10-06 13:15:55 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 366.0 MB)
2019-10-06 13:15:55 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 192.168.3.33:52398 (size: 23.4 KB, free: 366.3 MB)
2019-10-06 13:15:55 INFO SparkContext:54 - Created broadcast 0 from textFile at FileSort.scala:6
2019-10-06 13:15:56 INFO deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2019-10-06 13:15:56 INFO HadoopMapRedCommitProtocol:54 - Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
2019-10-06 13:15:56 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2019-10-06 13:15:56 INFO SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2019-10-06 13:15:56 INFO FileInputFormat:249 - Total input paths to process : 4
2019-10-06 13:15:56 WARN DAGScheduler:87 - Creating new stage failed due to exception - job: 0
java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
at scala.collection.SetLike$class.map(SetLike.scala:92)
at scala.collection.mutable.AbstractSet.map(Set.scala:46)
at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2019-10-06 13:15:56 INFO DAGScheduler:54 - Job 0 failed: runJob at SparkHadoopWriter.scala:78, took 0.105820 s
2019-10-06 13:15:56 ERROR SparkHadoopWriter:91 - Aborting job job_20191006131556_0007.
java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
at scala.collection.SetLike$class.map(SetLike.scala:92)
at scala.collection.mutable.AbstractSet.map(Set.scala:46)
at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
at FileSort$.main(FileSort.scala:18)
at FileSort.main(FileSort.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
at FileSort$.main(FileSort.scala:18)
at FileSort.main(FileSort.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
at scala.collection.SetLike$class.map(SetLike.scala:92)
at scala.collection.mutable.AbstractSet.map(Set.scala:46)
at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 42 more
2019-10-06 13:15:56 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-10-06 13:15:56 INFO AbstractConnector:318 - Stopped Spark@3caa4757{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-10-06 13:15:56 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.3.33:4040
2019-10-06 13:15:56 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-10-06 13:15:56 INFO MemoryStore:54 - MemoryStore cleared
2019-10-06 13:15:56 INFO BlockManager:54 - BlockManager stopped
2019-10-06 13:15:56 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-10-06 13:15:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-10-06 13:15:56 INFO SparkContext:54 - Successfully stopped SparkContext
2019-10-06 13:15:56 INFO ShutdownHookManager:54 - Shutdown hook called
2019-10-06 13:15:56 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/spark-d4a7bc0f-087d-4865-998e-e45846256092
2019-10-06 13:15:56 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/spark-fe4530b9-164f-48fd-aaa4-c943d4a165b0
三、二次排序
1、首先准备测试数据file1.txt,文件内容为:
5 3
1 6
4 9
8 3
4 7
5 6
3 2
2、新建一个自定义排序文件SecondarySortKey.scala,文件内容如下:
class SecondarySortKey (val first:Int,val second:Int) extends Ordered[SecondarySortKey] with Serializable{
def compare(other:SecondarySortKey):Int={
if(this.first-other.first!=0){//如果first不相等,则按照first排序
this.first-other.first
}else{//如果相等,则按照second排序
this.second-other.second
}
}
}
3、SecondarySortApp.scala文件,文件内容为:
import org.apache.spark.{SparkConf,SparkContext}
object SecondarySortApp {
def main(args:Array[String]): Unit ={
val conf=new SparkConf().setAppName("SecondarySort").setMaster("local")
val sc=new SparkContext(conf)
val lines=sc.textFile("file:///sunxj/work/spark/data")
//以SecondarySortKey作为key返回成key-value
val partSortKey=lines.map(x=>(new SecondarySortKey(x.split(" ")(0).toInt,x.split(" ")(1).toInt),x))
//开始以key排序,此时的key是SecondarySortKey对象,而SecondarySortKey对象中会有一个隐式的比较compare
//如果第一个参数相等则按照第二个参数比较进行排序
val sortd=partSortKey.sortByKey(false)
//将排序好的key-value,从中取出value值
val srtedResult=sortd.map(sortedLine=>sortedLine._2)
//计算打印
srtedResult.collect().foreach(println)
}
}
4、然后将pom.xml中的mainClass标签FileSort更换为SecondarySortApp即可,然后使用mvn package打包jar,然后提交到spark集群上,查看输出文件如下图所示 :