Spark流式在hdfs上的一种容错方式

Spark 在hdfs上的一种容错方式

在hdfs上一个指定的目录上,不断的运行wordcount。如果一个文件正在拷贝到hdfs上,但还没有完全复制完,同时被spark检测到并且被认为是新进入的文件,就会对这个文件运行wordcount,这个文件不完全还在处理写的状态,这就造成了错误。

src938.jpeg._COPYING_ 这个文件就是正处于被写入hdfs上的文件。第一次spark去读的时候,发生了错误,然后就报错了。当spark运行完当前的task,又会检测到这个src938.jpeg已经复制完成的文件,会进入下一个task。

具体的看下面的运行日志。

绿色的是发现目录有新文件的输出,红色的是发生错误的文件。

 

14/04/19 14:10:26INFO FileInputDStream: Finding newfiles took 32 ms

14/04/19 14:10:26 INFO FileInputDStream:New files at time 1397887826000 ms:

hdfs://192.168.178.181:9000/123/src923.jpeg

hdfs://192.168.178.181:9000/123/src924.jpeg

hdfs://192.168.178.181:9000/123/src925.jpeg

hdfs://192.168.178.181:9000/123/src926.jpeg

hdfs://192.168.178.181:9000/123/src927.jpeg

hdfs://192.168.178.181:9000/123/src928.jpeg

hdfs://192.168.178.181:9000/123/src929.jpeg

hdfs://192.168.178.181:9000/123/src930.jpeg

hdfs://192.168.178.181:9000/123/src931.jpeg

hdfs://192.168.178.181:9000/123/src932.jpeg

hdfs://192.168.178.181:9000/123/src933.jpeg

hdfs://192.168.178.181:9000/123/src934.jpeg

hdfs://192.168.178.181:9000/123/src935.jpeg

hdfs://192.168.178.181:9000/123/src936.jpeg

hdfs://192.168.178.181:9000/123/src937.jpeg

hdfs://192.168.178.181:9000/123/src938.jpeg._COPYING_

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=138682568, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_780 stored as values to memory (estimated size 173.6 KB, free 140.2MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=138860366, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_781 stored as values to memory (estimated size 173.6 KB, free 140.0MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139038164, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_782 stored as values to memory (estimated size 173.6 KB, free 139.9MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139215962, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_783 stored as values to memory (estimated size 173.6 KB, free 139.7MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139393760, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_784 stored as values to memory (estimated size 173.6 KB, free 139.5MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139571558, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_785 stored as values to memory (estimated size 173.6 KB, free 139.3MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139749356, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_786 stored as values to memory (estimated size 173.6 KB, free 139.2MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139927154, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_787 stored as values to memory (estimated size 173.6 KB, free 139.0MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140104952, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_788 stored as values to memory (estimated size 173.6 KB, free 138.8MB)

14/04/19 14:10:26 INFO TaskSetManager:Finished TID 841 in 718 ms on slave1 (progress: 9/10)

14/04/19 14:10:26 INFO DAGScheduler:Completed ShuffleMapTask(199, 1)

14/04/19 14:10:26 INFO DAGScheduler: Stage199 (combineByKey at ShuffledDStream.scala:42) finished in 0.718 s

14/04/19 14:10:26 INFO DAGScheduler:looking for newly runnable stages

14/04/19 14:10:26 INFO DAGScheduler:running: Set()

14/04/19 14:10:26 INFO DAGScheduler:waiting: Set(Stage 198)

14/04/19 14:10:26 INFO DAGScheduler:failed: Set()

14/04/19 14:10:26 INFO DAGScheduler:Missing parents for Stage 198: List()

14/04/19 14:10:26 INFO DAGScheduler:Submitting Stage 198 (MapPartitionsRDD[1145] at combineByKey atShuffledDStream.scala:42), which is now runnable

14/04/19 14:10:26 INFO TaskSchedulerImpl:Remove TaskSet 199.0 from pool

14/04/19 14:10:26 INFO DAGScheduler:Submitting 1 missing tasks from Stage 198 (MapPartitionsRDD[1145] atcombineByKey at ShuffledDStream.scala:42)

14/04/19 14:10:26 INFO TaskSchedulerImpl:Adding task set 198.0 with 1 tasks

14/04/19 14:10:26 INFO TaskSetManager:Starting task 198.0:0 as TID 850 on executor 2: slave3 (PROCESS_LOCAL)

14/04/19 14:10:26 INFO TaskSetManager:Serialized task 198.0:0 as 2099 bytes in 0 ms

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140282750, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_789 stored as values to memory (estimated size 173.6 KB, free 138.7MB)

14/04/19 14:10:26 INFO MapOutputTrackerMasterActor:Asked to send map output locations for shuffle 45 to spark@slave3:45214

14/04/19 14:10:26 INFOMapOutputTrackerMaster: Size of output statuses for shuffle 45 is 172 bytes

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140460548, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_790 stored as values to memory (estimated size 173.6 KB, free 138.5MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140638346, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_791 stored as values to memory (estimated size 173.6 KB, free 138.3MB)

14/04/19 14:10:26 INFO TaskSetManager:Finished TID 850 in 47 ms on slave3 (progress: 0/1)

14/04/19 14:10:26 INFO TaskSchedulerImpl:Remove TaskSet 198.0 from pool

14/04/19 14:10:26 INFO DAGScheduler:Completed ResultTask(198, 0)

14/04/19 14:10:26 INFO DAGScheduler: Stage198 (take at DStream.scala:586) finished in 0.046 s

14/04/19 14:10:26 INFO SparkContext: Jobfinished: take at DStream.scala:586, took 0.78418979 s

-------------------------------------------

Time: 1397887824000 ms

-------------------------------------------

(,622)

(B?=zS??L6??1?D


?pK7A???e?X??N?c?1?j?b?)'
?R?|?

                                              ?,1)

('


<,1)

(?N?8???*<?J?m???9jq??0);Q{?V???W4?WЦ?F?$-?[??<??O?M6?Ь???W??s??@???(??s??????0,1)

(Vl?s?zRB??I!??8?c?hx????q????M=F?J-?de]?,1)

(???????(#?S?P?(|?1Ug   6?4?bY??O?>???5,1)

(P?W?a?&

        @?????N??8?Q?|?nGR9?Rs?T?hN????

                                            ?????,1)

(??^25?I?????k


?h?#?b??j??zT?]i???R?????c?S?W?1)?.H\?D.?Bx?kW-??~??.??\7?oqdp???}+t%?q?c|????臲ofs?Z?.1??5
?|D??t2h?????t??:?]?I!~?qZ*?i;?Q+??!T!X?O?F???4{???M?=?s?U?S??1??

                          ????}i????e'5????(w+???O??;??5$?H????;???:[1]??A?"?zl??Q_?Mv?qV??#t(???h?H^L?l?4?32???+8U???-?=K??       #V2?EZ?p???v???汐*?玠???f9????#?I?"T?;?x?U?m?M???n~l

                                    f?o?am?1??"?uM?v?

                                                     x???7?y=p+?GOu?c?1J?,1)

([1]2??Ev,1)

(??}jO??


???>?????N>?%R1?/??X??{?S5?b??`??y6??j'??\??u=>?[T?Zm?I???=?e:?Td?z2?o5fAa<???!c?@??@??p?J??,k?[?yhW?A?hBv?氣????[1]?A
???
?)vO

                                                                                                                                             A?P?21???on?]?$??,1)

...

 

14/04/19 14:10:26 INFO JobScheduler:Finished job streaming job 1397887824000 ms.0 from job set of time1397887824000 ms

14/04/19 14:10:26 INFO JobScheduler:Starting job streaming job 1397887824000 ms.1 from job set of time1397887824000 ms

14/04/19 14:10:26 INFO SparkContext:Starting job: saveAsTextFile at DStream.scala:762

14/04/19 14:10:26 INFO DAGScheduler: Gotjob 100 (saveAsTextFile at DStream.scala:762) with 2 output partitions(allowLocal=false)

14/04/19 14:10:26 INFO DAGScheduler: Finalstage: Stage 200 (saveAsTextFile at DStream.scala:762)

14/04/19 14:10:26 INFO DAGScheduler:Parents of final stage: List(Stage 201)

14/04/19 14:10:26 INFO DAGScheduler:Missing parents: List()

14/04/19 14:10:26 INFO DAGScheduler:Submitting Stage 200 (MappedRDD[1159] at saveAsTextFile at DStream.scala:762),which has no missing parents

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140816144, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_792 stored as values to memory (estimated size 173.6 KB, free 138.2MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140993942, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_793 stored as values to memory (estimated size 173.6 KB, free 138.0MB)

14/04/19 14:10:26 INFO DAGScheduler:Submitting 2 missing tasks from Stage 200 (MappedRDD[1159] at saveAsTextFile atDStream.scala:762)

14/04/19 14:10:26 INFO TaskSchedulerImpl:Adding task set 200.0 with 2 tasks

14/04/19 14:10:26 INFO TaskSetManager: Startingtask 200.0:0 as TID 851 on executor 2: slave3 (PROCESS_LOCAL)

14/04/19 14:10:26 INFO TaskSetManager:Serialized task 200.0:0 as 11653 bytes in 0 ms

14/04/19 14:10:26 INFO TaskSetManager:Starting task 200.0:1 as TID 852 on executor 1: slave1 (PROCESS_LOCAL)

14/04/19 14:10:26 INFO TaskSetManager:Serialized task 200.0:1 as 11653 bytes in 0 ms

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=141171740, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_794 stored as values to memory (estimated size 173.6 KB, free 137.8MB)

14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177822) called with curMem=141349538, maxMem=285868032

14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_795 stored as values to memory (estimated size 173.7 KB, free 137.7MB)

14/04/19 14:10:26 INFOMapOutputTrackerMasterActor: Asked to send map output locations for shuffle 45to spark@slave1:38185

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1

14/04/19 14:10:26 ERROR JobScheduler: Errorgenerating jobs for time 1397887826000 ms

org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs://192.168.178.181:9000/123/src938.jpeg._COPYING_

  atorg.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)

  atorg.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)

  atorg.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:75)

  atorg.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)

  atorg.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)

  atscala.Option.getOrElse(Option.scala:120)

  atorg.apache.spark.rdd.RDD.partitions(RDD.scala:205)

  atorg.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)

  atorg.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)

  atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

  atorg.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)

  atorg.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)

  atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)

  atorg.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)

  atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)

  atorg.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)

  atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)

  atorg.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)

  atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)

  atorg.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)

  atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)

  atorg.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)

  atorg.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)

  atorg.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)

  atscala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)

  atscala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)

  atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

  atscala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)

  atscala.collection.AbstractTraversable.flatMap(Traversable.scala:105)

  atorg.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:115)

  atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:160)

  atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:160)

  atscala.util.Try$.apply(Try.scala:161)

  atorg.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:160)

  atorg.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:104)

  atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:69)

  atakka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

  atakka.actor.ActorCell.invoke(ActorCell.scala:456)

  atakka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

  atakka.dispatch.Mailbox.run(Mailbox.scala:219)

  atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

  atscala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

  atscala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

  atscala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

  atscala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

14/04/19 14:10:26 INFO TaskSetManager:Finished TID 851 in 513 ms on slave3 (progress: 0/2)

14/04/19 14:10:26 INFO DAGScheduler:Completed ResultTask(200, 0)

14/04/19 14:10:27 INFO TaskSetManager: FinishedTID 852 in 577 ms on slave1 (progress: 1/2)

14/04/19 14:10:27 INFO TaskSchedulerImpl:Remove TaskSet 200.0 from pool

14/04/19 14:10:27 INFO DAGScheduler:Completed ResultTask(200, 1)

14/04/19 14:10:27 INFO DAGScheduler: Stage200 (saveAsTextFile at DStream.scala:762) finished in 0.573 s

14/04/19 14:10:27 INFO SparkContext: Jobfinished: saveAsTextFile at DStream.scala:762, took 0.678888891 s

14/04/19 14:10:27 INFO JobScheduler:Finished job streaming job 1397887824000 ms.1 from job set of time 1397887824000ms

14/04/19 14:10:27 INFO JobScheduler: Totaldelay: 3.070 s for time 1397887824000 ms (execution: 1.514 s)

14/04/19 14:10:27 INFO FileInputDStream:Cleared 2 old files that were older than 1397887822000 ms: 1397887818000 ms,1397887820000 ms

14/04/19 14:10:27 INFO FileInputDStream:Cleared 0 old files that were older than 1397887822000 ms:

14/04/19 14:10:28 INFO FileInputDStream:Finding new files took 13 ms

14/04/19 14:10:28INFO FileInputDStream: New files at time 1397887828000 ms:

hdfs://192.168.178.181:9000/123/src938.jpeg

hdfs://192.168.178.181:9000/123/src939.jpeg

hdfs://192.168.178.181:9000/123/src940.jpeg

hdfs://192.168.178.181:9000/123/src941.jpeg

hdfs://192.168.178.181:9000/123/src942.jpeg

hdfs://192.168.178.181:9000/123/src943.jpeg

hdfs://192.168.178.181:9000/123/src944.jpeg

hdfs://192.168.178.181:9000/123/src945.jpeg

hdfs://192.168.178.181:9000/123/src946.jpeg

hdfs://192.168.178.181:9000/123/src947.jpeg

hdfs://192.168.178.181:9000/123/src948.jpeg

hdfs://192.168.178.181:9000/123/src949.jpeg

hdfs://192.168.178.181:9000/123/src950.jpeg

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值