写了一个简单的语句,还没有优化:
scala> sc.
| textFile("/etc/profile").
| flatMap((s:String)=>s.split("\\s")).
| map(_.toUpperCase).
| map((s:String)=>(s, 1)).
| filter((pair)=>pair._1.forall((ch)=>ch>'A'&&ch<'Z')).
| reduceByKey(_+_).
| sortByKey().
| foreach(println)
注意这代码还可以优化:
scala> sc.
| textFile("/etc/profile").
| flatMap(_.split("\\s")).
| map(_.toUpperCase).
| map((_, 1)).
| filter(_._1.forall((ch)=>ch>'A'&&ch<'Z')).
| reduceByKey(_+_).
| sortByKey().
| foreach(println)
输出结果如下:
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(75904) called with curMem=259812, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 74.1 KB, free 264.7 MB)
15/03/06 08:50:44 INFO FileInputFormat: Total input paths to process : 1
15/03/06 08:50:44 INFO SparkContext: Starting job: sortByKey at <console>:20
15/03/06 08:50:44 INFO DAGScheduler: Registering RDD 25 (filter at <console>:18)
15/03/06 08:50:44 INFO DAGScheduler: Got job 4 (sortByKey at <console>:20) with 2 output partitions (allowLocal=false)
15/03/06 08:50:44 INFO DAGScheduler: Final stage: Stage 10(sortByKey at <console>:20)
15/03/06 08:50:44 INFO DAGScheduler: Parents of final stage: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Missing parents: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 11 (FilteredRDD[25] at filter at <console>:18), which has no missing parents
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(3736) called with curMem=335716, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 3.6 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 11 (FilteredRDD[25] at filter at <console>:18)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 11.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 16, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 17, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO Executor: Running task 1.0 in stage 11.0 (TID 17)
15/03/06 08:50:44 INFO Executor: Running task 0.0 in stage 11.0 (TID 16)
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:1189+1189
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:0+