用spark(spark-shell),从本地文件创建一个RDD

scala> val textFile=sc.textFile("README.md")



16/07/07 07:41:40 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/07/07 07:41:40 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 24.2 KB, free 24.2 KB)
16/07/07 07:41:40 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.4 KB, free 29.6 KB)
16/07/07 07:41:40 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58644 (size: 5.4 KB, free: 517.4 MB)
16/07/07 07:41:41 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:21
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:21


2、返回RDD元素的个数:
textFile.count()

16/07/07 07:49:09 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/07 07:49:19 INFO spark.SparkContext: Starting job: count at <console>:24
16/07/07 07:49:20 INFO scheduler.DAGScheduler: Got job 0 (count at <console>:24) with 1 output partitions
16/07/07 07:49:20 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at <console>:24)
16/07/07 07:49:20 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/07 07:49:20 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/07 07:49:20 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (input MapPartitionsRDD[5] at textFile at <console>:21), which has no missing parents
16/07/07 07:49:20 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 206.4 KB)
16/07/07 07:49:20 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1759.0 B, free 208.1 KB)
16/07/07 07:49:20 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:58644 (size: 1759.0 B, free: 517.4 MB)
16/07/07 07:49:20 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/07/07 07:49:21 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (input MapPartitionsRDD[5] at textFile at <console>:21)
16/07/07 07:49:21 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/07 07:49:21 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2146 bytes)
16/07/07 07:49:21 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/07 07:49:21 INFO rdd.HadoopRDD: Input split: hdfs://192.168.147.129:9000/user/root/input:0+1366
16/07/07 07:49:24 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2082 bytes result sent to driver
16/07/07 07:49:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3420 ms on localhost (1/1)
16/07/07 07:49:24 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/07 07:49:24 INFO scheduler.DAGScheduler: ResultStage 0 (count at <console>:24) finished in 3.575 s
16/07/07 07:49:25 INFO scheduler.DAGScheduler: Job 0 finished: count at <console>:24, took 5.288920 s
res2: Long = 31//##(sc.textFile创建的文本类型的RDD,每行一个元素,31表示文本的行数)

scala>

3、如若文件不存在,则有:
scala> textFile.count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://192.168.147.129:9000/user/root/input.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
    at $iwC$$iwC$$iwC.<init>(<console>:33)
    at $iwC$$iwC.<init>(<console>:35)
    at $iwC.<init>(<console>:37)
    at <init>(<console>:39)
    at .<init>(<console>:43)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


4、取RDD的第一个元素
scala> textFile.first()

16/07/07 07:58:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/07 07:58:10 WARN snappy.LoadSnappy: Snappy native library not loaded
16/07/07 07:58:10 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/07 07:58:20 INFO spark.SparkContext: Starting job: first at <console>:24
16/07/07 07:58:20 INFO scheduler.DAGScheduler: Got job 0 (first at <console>:24) with 1 output partitions
16/07/07 07:58:20 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (first at <console>:24)
16/07/07 07:58:20 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/07 07:58:20 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/07 07:58:20 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (input MapPartitionsRDD[1] at textFile at <console>:21), which has no missing parents
16/07/07 07:58:21 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 32.7 KB)
16/07/07 07:58:21 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1814.0 B, free 34.5 KB)
16/07/07 07:58:21 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:34256 (size: 1814.0 B, free: 517.4 MB)
16/07/07 07:58:21 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/07 07:58:21 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (input MapPartitionsRDD[1] at textFile at <console>:21)
16/07/07 07:58:21 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/07 07:58:21 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2146 bytes)
16/07/07 07:58:21 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/07 07:58:21 INFO rdd.HadoopRDD: Input split: hdfs://192.168.147.129:9000/user/root/input:0+1366
16/07/07 07:58:23 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2116 bytes result sent to driver
16/07/07 07:58:23 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2537 ms on localhost (1/1)
16/07/07 07:58:24 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/07 07:58:24 INFO scheduler.DAGScheduler: ResultStage 0 (first at <console>:24) finished in 3.203 s
16/07/07 07:58:24 INFO scheduler.DAGScheduler: Job 0 finished: first at <console>:24, took 3.902921 s
res1: String = For the latest information about Hadoop, please visit our website at:

scala>


5、统计一个文档(这里以input作为例子)统计包含关键字Spark的行数

scala> val textFile=sc.textFile("input")

16/07/08 00:03:43 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/07/08 00:03:44 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 24.2 KB, free 24.2 KB)
16/07/08 00:03:44 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.4 KB, free 29.6 KB)
16/07/08 00:03:44 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45231 (size: 5.4 KB, free: 517.4 MB)
16/07/08 00:03:44 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:21
textFile: org.apache.spark.rdd.RDD[String] = input MapPartitionsRDD[1] at textFile at <console>:21

scala> val lineswithspark=textFile.filter(line=>line.contains("Spark"))

lineswithspark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

scala> lineswithspark.count()

6/07/08 00:10:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/08 00:10:50 WARN snappy.LoadSnappy: Snappy native library not loaded
16/07/08 00:10:50 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/08 00:10:51 INFO spark.SparkContext: Starting job: count at <console>:26
16/07/08 00:10:51 INFO scheduler.DAGScheduler: Got job 0 (count at <console>:26) with 1 output partitions
16/07/08 00:10:51 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at <console>:26)
16/07/08 00:10:51 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/08 00:10:51 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/08 00:10:51 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:23), which has no missing parents
16/07/08 00:10:52 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 32.8 KB)
16/07/08 00:10:52 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1869.0 B, free 34.6 KB)
16/07/08 00:10:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:45231 (size: 1869.0 B, free: 517.4 MB)
16/07/08 00:10:52 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/08 00:10:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:23)
16/07/08 00:10:53 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/08 00:10:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2146 bytes)
16/07/08 00:10:57 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/08 00:10:58 INFO rdd.HadoopRDD: Input split: hdfs://192.168.147.129:9000/user/root/input:0+1366
16/07/08 00:10:59 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2082 bytes result sent to driver
16/07/08 00:10:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2444 ms on localhost (1/1)
16/07/08 00:11:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/08 00:11:00 INFO scheduler.DAGScheduler: ResultStage 0 (count at <console>:26) finished in 6.924 s
16/07/08 00:11:00 INFO scheduler.DAGScheduler: Job 0 finished: count at <console>:26, took 9.399461 s
res1: Long = 0

scala> textFile.filter(line=>line.contains("Spark")).count()

16/07/08 00:11:58 INFO spark.SparkContext: Starting job: count at <console>:24
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Got job 2 (count at <console>:24) with 1 output partitions
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (count at <console>:24)
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[4] at filter at <console>:24), which has no missing parents
16/07/08 00:11:58 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.1 KB, free 42.7 KB)
16/07/08 00:11:58 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1869.0 B, free 44.5 KB)
16/07/08 00:11:58 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:45231 (size: 1869.0 B, free: 517.4 MB)
16/07/08 00:11:58 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[4] at filter at <console>:24)
16/07/08 00:11:58 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/07/08 00:11:58 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2146 bytes)
16/07/08 00:11:58 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
16/07/08 00:11:58 INFO rdd.HadoopRDD: Input split: hdfs://192.168.147.129:9000/user/root/input:0+1366
16/07/08 00:11:58 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2082 bytes result sent to driver
16/07/08 00:11:58 INFO scheduler.DAGScheduler: ResultStage 2 (count at <console>:24) finished in 0.262 s
16/07/08 00:11:58 INFO scheduler.DAGScheduler: Job 2 finished: count at <console>:24, took 0.380457 s
res3: Long = 0




  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

星之擎

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值