spark下统计单词频次

博客内容展示了如何在Spark环境下进行单词频次统计,提到当前代码还有优化空间。通过特定的Spark命令,可以查看参与计算的节点名称。
摘要由CSDN通过智能技术生成

写了一个简单的语句,还没有优化:

scala> sc.
     | textFile("/etc/profile").
     | flatMap((s:String)=>s.split("\\s")).
     | map(_.toUpperCase).
     | map((s:String)=>(s, 1)).
     | filter((pair)=>pair._1.forall((ch)=>ch>'A'&&ch<'Z')).
     | reduceByKey(_+_).
     | sortByKey().
     | foreach(println)

注意这代码还可以优化:

scala> sc.
     | textFile("/etc/profile").
     | flatMap(_.split("\\s")).
     | map(_.toUpperCase).
     | map((_, 1)).
     | filter(_._1.forall((ch)=>ch>'A'&&ch<'Z')).
     | reduceByKey(_+_).
     | sortByKey().
     | foreach(println)

输出结果如下:

15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(75904) called with curMem=259812, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 74.1 KB, free 264.7 MB)
15/03/06 08:50:44 INFO FileInputFormat: Total input paths to process : 1
15/03/06 08:50:44 INFO SparkContext: Starting job: sortByKey at <console>:20
15/03/06 08:50:44 INFO DAGScheduler: Registering RDD 25 (filter at <console>:18)
15/03/06 08:50:44 INFO DAGScheduler: Got job 4 (sortByKey at <console>:20) with 2 output partitions (allowLocal=false)
15/03/06 08:50:44 INFO DAGScheduler: Final stage: Stage 10(sortByKey at <console>:20)
15/03/06 08:50:44 INFO DAGScheduler: Parents of final stage: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Missing parents: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 11 (FilteredRDD[25] at filter at <console>:18), which has no missing parents
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(3736) called with curMem=335716, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 3.6 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 11 (FilteredRDD[25] at filter at <console>:18)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 11.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 16, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 17, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO Executor: Running task 1.0 in stage 11.0 (TID 17)
15/03/06 08:50:44 INFO Executor: Running task 0.0 in stage 11.0 (TID 16)
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:1189+1189
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:0+
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值