大数据-spark:综合实例、求top值、文件排序、二次排序

一、求top值实例

1、首先准备数据文件,假设有两个文件,内容以逗号分隔,分别是orderid,userid,payment,productid要求求出payment TOP N个,下面给出file1.txt、file2.txt、file3.txt文件,文件内容为:

(1)file1.txt文件内容:

1,1734,43,155
2,4323,12,34223
3,5442,32,3453
4,1243,34,342
5,1223,20,342
6,542,570,64
7,122,10,123
8,42,30,345
9,152,40,1123

(2)file2.txt文件内容:

10,435,67,155
11,567,9,34223
12,765,67,3453
13,78,7,342
14,234,3,342
15,567,344,64
16,78,3422,123
17,96,3425,345
18,345,565,1123

(3)file3.txt文件内容:

19,435,2342,155
20,567,345,34223
21,765,4332,3453
22,78,231,342
23,234,5463,342
24,567,2342,64
25,78,45634,123
26,96,23,345
27,345,456,1123

2、然后将此3个文件上传至hdfs://master:9000/example,通过:hadoop fs -put file* hdfs://master:9000/example如下图所示:

3、然后新建一个项目,并配置pom.xml和TopN.scala文件:

(1)其中pom.xml内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>comprehensive-example</groupId>
    <artifactId>comprehensiveexample</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SparkWriteHBase</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(2)其中TopN.scala内容如下:

import org.apache.spark.{SparkConf,SparkContext}
object TopN {
  def main(args:Array[String]): Unit ={
    val conf=new SparkConf().setAppName("TopN").setMaster("local")
    val sc=new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val lines=sc.textFile("hdfs://master:9000/example/",3)
    var num=0
    val result=lines.filter(l=>l.length>0&&l.split(",").length==4)//过滤出符合条件的记录
      .map(l=>l.split(",")(2))//取出payment字段
      .map(x=>(x.toInt,""))//将字符串转换为数字,然后返回一个key-value
      .sortByKey(false)//按照降序排序
      .map(x=>x._1)//取出key
      .take(5)//选择前5个
      .foreach(x=>{//遍历输出
        num=num+1
        println(num+"\t"+x)
      })
  }
}

(3)然后通过mvn package打包jar,然后通过:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar运行

输出结果如下:

(4)修改代码,直接以记录的方式打印出,修改代码如下:

import org.apache.spark.{SparkConf,SparkContext}
object TopN {
  def main(args:Array[String]): Unit ={
    val conf=new SparkConf().setAppName("TopN").setMaster("local")
    val sc=new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val lines=sc.textFile("hdfs://master:9000/example/")
    var num=0
    val result=lines.filter(l=>l.length>0&&l.split(",").length==4)//过滤出符合条件的记录
      .map(l=>(l.split(",")(2).toInt,l))//取出payment字段,并将原记录以key-value方式存储
      .sortByKey(false)//按照key降序排序
      .take(5)//选择前5个
      .foreach(x=>{//遍历输出
        num=num+1
        println(num+"\t"+x._2)
      })
  }
}

(5)然后重新打包jar,使用spark-submit提交,运行结果如下图所示:

二、文件排序

1、准备文件file1.txt,file2.txt,file3.txt文件

(1)file1.txt文件内容为:

33
37
12
40

(2)file2.txt文件内容为:

4
16
39
5

(3)file3.txt文件内容为:

1
45
25

2、新建一个FileSort.scala文件,文件内容如下:

import org.apache.spark.{SparkConf,SparkContext,HashPartitioner}
object FileSort {
  def main(args:Array[String]): Unit ={
    val conf=new SparkConf().setAppName("FileSort")
    val sc=new SparkContext(conf)
    val lines=sc.textFile("file:///sunxj/work/spark/data",3)//读取文件
    var idx=0
    val result=lines.filter(l=>l.trim.length>0)//过滤空行
      .map(x=>(x.trim.toInt,""))//转换成Int型并以key-value方式返回
      .partitionBy(new HashPartitioner(1))//将三个分区文件合并成一个分区
      .sortByKey()//合并
      .map(
        x=>{
          idx=idx+1
          (idx,x._1)//将索引和值以key-value方式返回
        }
      )
    result.saveAsTextFile("file:///sunxj/work/spark/result/")//保存到本地/sunxj/work/spark/result/目录中
  }

}

3、然后将pom.xml中的mainClass标签TopN更换为FileSort即可,然后使用mvn package打包jar,然后提交到spark集群上,查看输出文件如下图所示 :

注意:输出文件结果存放的目录不能再输入目录内,比如:file:///sunxj/work/spark/data(输入)而输出不能再data内,比如:file:///sunxj/work/spark/data/result(输出目录),否则会出现不是一个文件,如下错误信息:

2019-10-06 13:15:52 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-06 13:15:52 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-10-06 13:15:52 INFO  SparkContext:54 - Submitted application: FileSort
2019-10-06 13:15:52 INFO  SecurityManager:54 - Changing view acls to: sunxiaoju
2019-10-06 13:15:52 INFO  SecurityManager:54 - Changing modify acls to: sunxiaoju
2019-10-06 13:15:52 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-10-06 13:15:52 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-10-06 13:15:52 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(sunxiaoju); groups with view permissions: Set(); users  with modify permissions: Set(sunxiaoju); groups with modify permissions: Set()
2019-10-06 13:15:53 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 52397.
2019-10-06 13:15:53 INFO  SparkEnv:54 - Registering MapOutputTracker
2019-10-06 13:15:53 INFO  SparkEnv:54 - Registering BlockManagerMaster
2019-10-06 13:15:53 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-10-06 13:15:53 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-10-06 13:15:53 INFO  DiskBlockManager:54 - Created local directory at /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/blockmgr-9484bbd8-e040-4fac-a652-42d8c1561641
2019-10-06 13:15:53 INFO  MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-10-06 13:15:53 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2019-10-06 13:15:53 INFO  log:192 - Logging initialized @2900ms
2019-10-06 13:15:53 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-10-06 13:15:53 INFO  Server:419 - Started @3040ms
2019-10-06 13:15:53 INFO  AbstractConnector:278 - Started ServerConnector@3caa4757{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-10-06 13:15:53 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62c5bbdc{/jobs,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c321bdb{/jobs/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24855019{/jobs/job,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4d4d8fcf{/jobs/job/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@610db97e{/stages,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f0628de{/stages/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3fabf088{/stages/stage,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ced35ed{/stages/stage/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c22a348{/stages/pool,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7bd69e82{/stages/pool/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74d7184a{/storage,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51b01960{/storage/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6831d8fd{/storage/rdd,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27dc79f7{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b85300e{/environment,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3aaf4f07{/environment/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5cbf9e9f{/executors,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@18e8473e{/executors/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a2f016d{/executors/threadDump,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a38ba58{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3ad394e6{/static,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4acf72b6{/,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7561db12{/api,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e9c413e{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@57a4d5ee{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-10-06 13:15:53 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://192.168.3.33:4040
2019-10-06 13:15:53 INFO  SparkContext:54 - Added JAR file:/sunxj/work/spark/comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar at spark://192.168.3.33:52397/jars/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1570338953865
2019-10-06 13:15:53 INFO  Executor:54 - Starting executor ID driver on host localhost
2019-10-06 13:15:54 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52398.
2019-10-06 13:15:54 INFO  NettyBlockTransferService:54 - Server created on 192.168.3.33:52398
2019-10-06 13:15:54 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-10-06 13:15:54 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 192.168.3.33:52398 with 366.3 MB RAM, BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 192.168.3.33, 52398, None)
2019-10-06 13:15:54 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@34523d46{/metrics/json,null,AVAILABLE,@Spark}
2019-10-06 13:15:55 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 242.5 KB, free 366.1 MB)
2019-10-06 13:15:55 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 366.0 MB)
2019-10-06 13:15:55 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 192.168.3.33:52398 (size: 23.4 KB, free: 366.3 MB)
2019-10-06 13:15:55 INFO  SparkContext:54 - Created broadcast 0 from textFile at FileSort.scala:6
2019-10-06 13:15:56 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2019-10-06 13:15:56 INFO  HadoopMapRedCommitProtocol:54 - Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
2019-10-06 13:15:56 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2019-10-06 13:15:56 INFO  SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2019-10-06 13:15:56 INFO  FileInputFormat:249 - Total input paths to process : 4
2019-10-06 13:15:56 WARN  DAGScheduler:87 - Creating new stage failed due to exception - job: 0
java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
	at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
	at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at scala.collection.mutable.AbstractSet.map(Set.scala:46)
	at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
	at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2019-10-06 13:15:56 INFO  DAGScheduler:54 - Job 0 failed: runJob at SparkHadoopWriter.scala:78, took 0.105820 s
2019-10-06 13:15:56 ERROR SparkHadoopWriter:91 - Aborting job job_20191006131556_0007.
java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
	at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
	at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at scala.collection.mutable.AbstractSet.map(Set.scala:46)
	at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
	at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
	at FileSort$.main(FileSort.scala:18)
	at FileSort.main(FileSort.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
	at FileSort$.main(FileSort.scala:18)
	at FileSort.main(FileSort.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Not a file: file:/sunxj/work/spark/data/result
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
	at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
	at org.apache.spark.scheduler.DAGScheduler.getMissingAncestorShuffleDependencies(DAGScheduler.scala:479)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:346)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:462)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:461)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at scala.collection.mutable.AbstractSet.map(Set.scala:46)
	at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
	at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
	... 42 more
2019-10-06 13:15:56 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2019-10-06 13:15:56 INFO  AbstractConnector:318 - Stopped Spark@3caa4757{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-10-06 13:15:56 INFO  SparkUI:54 - Stopped Spark web UI at http://192.168.3.33:4040
2019-10-06 13:15:56 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-10-06 13:15:56 INFO  MemoryStore:54 - MemoryStore cleared
2019-10-06 13:15:56 INFO  BlockManager:54 - BlockManager stopped
2019-10-06 13:15:56 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-10-06 13:15:56 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-10-06 13:15:56 INFO  SparkContext:54 - Successfully stopped SparkContext
2019-10-06 13:15:56 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-10-06 13:15:56 INFO  ShutdownHookManager:54 - Deleting directory /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/spark-d4a7bc0f-087d-4865-998e-e45846256092
2019-10-06 13:15:56 INFO  ShutdownHookManager:54 - Deleting directory /private/var/folders/7m/ls3n9dj958g25ktsw8d9cym80000gn/T/spark-fe4530b9-164f-48fd-aaa4-c943d4a165b0

三、二次排序

1、首先准备测试数据file1.txt,文件内容为:

5 3
1 6
4 9
8 3
4 7
5 6
3 2

2、新建一个自定义排序文件SecondarySortKey.scala,文件内容如下:

class SecondarySortKey (val first:Int,val second:Int) extends Ordered[SecondarySortKey] with Serializable{
  def compare(other:SecondarySortKey):Int={
    if(this.first-other.first!=0){//如果first不相等,则按照first排序
      this.first-other.first
    }else{//如果相等,则按照second排序
      this.second-other.second
    }
  }
}

3、SecondarySortApp.scala文件,文件内容为:

import org.apache.spark.{SparkConf,SparkContext}
object SecondarySortApp {
  def main(args:Array[String]): Unit ={
    val conf=new SparkConf().setAppName("SecondarySort").setMaster("local")
    val sc=new SparkContext(conf)
    val lines=sc.textFile("file:///sunxj/work/spark/data")
    //以SecondarySortKey作为key返回成key-value
    val partSortKey=lines.map(x=>(new SecondarySortKey(x.split(" ")(0).toInt,x.split(" ")(1).toInt),x))
    //开始以key排序,此时的key是SecondarySortKey对象,而SecondarySortKey对象中会有一个隐式的比较compare
    //如果第一个参数相等则按照第二个参数比较进行排序
    val sortd=partSortKey.sortByKey(false)
    //将排序好的key-value,从中取出value值
    val srtedResult=sortd.map(sortedLine=>sortedLine._2)
    //计算打印
    srtedResult.collect().foreach(println)
  }
}

4、然后将pom.xml中的mainClass标签FileSort更换为SecondarySortApp即可,然后使用mvn package打包jar,然后提交到spark集群上,查看输出文件如下图所示 :

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值