Spark 在hdfs上的一种容错方式
在hdfs上一个指定的目录上,不断的运行wordcount。如果一个文件正在拷贝到hdfs上,但还没有完全复制完,同时被spark检测到并且被认为是新进入的文件,就会对这个文件运行wordcount,这个文件不完全还在处理写的状态,这就造成了错误。
src938.jpeg._COPYING_ 这个文件就是正处于被写入hdfs上的文件。第一次spark去读的时候,发生了错误,然后就报错了。当spark运行完当前的task,又会检测到这个src938.jpeg已经复制完成的文件,会进入下一个task。
具体的看下面的运行日志。
绿色的是发现目录有新文件的输出,红色的是发生错误的文件。
14/04/19 14:10:26INFO FileInputDStream: Finding newfiles took 32 ms
14/04/19 14:10:26 INFO FileInputDStream:New files at time 1397887826000 ms:
hdfs://192.168.178.181:9000/123/src923.jpeg
hdfs://192.168.178.181:9000/123/src924.jpeg
hdfs://192.168.178.181:9000/123/src925.jpeg
hdfs://192.168.178.181:9000/123/src926.jpeg
hdfs://192.168.178.181:9000/123/src927.jpeg
hdfs://192.168.178.181:9000/123/src928.jpeg
hdfs://192.168.178.181:9000/123/src929.jpeg
hdfs://192.168.178.181:9000/123/src930.jpeg
hdfs://192.168.178.181:9000/123/src931.jpeg
hdfs://192.168.178.181:9000/123/src932.jpeg
hdfs://192.168.178.181:9000/123/src933.jpeg
hdfs://192.168.178.181:9000/123/src934.jpeg
hdfs://192.168.178.181:9000/123/src935.jpeg
hdfs://192.168.178.181:9000/123/src936.jpeg
hdfs://192.168.178.181:9000/123/src937.jpeg
hdfs://192.168.178.181:9000/123/src938.jpeg._COPYING_
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=138682568, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_780 stored as values to memory (estimated size 173.6 KB, free 140.2MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=138860366, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_781 stored as values to memory (estimated size 173.6 KB, free 140.0MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139038164, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_782 stored as values to memory (estimated size 173.6 KB, free 139.9MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139215962, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_783 stored as values to memory (estimated size 173.6 KB, free 139.7MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139393760, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_784 stored as values to memory (estimated size 173.6 KB, free 139.5MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139571558, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_785 stored as values to memory (estimated size 173.6 KB, free 139.3MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139749356, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_786 stored as values to memory (estimated size 173.6 KB, free 139.2MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=139927154, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_787 stored as values to memory (estimated size 173.6 KB, free 139.0MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140104952, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_788 stored as values to memory (estimated size 173.6 KB, free 138.8MB)
14/04/19 14:10:26 INFO TaskSetManager:Finished TID 841 in 718 ms on slave1 (progress: 9/10)
14/04/19 14:10:26 INFO DAGScheduler:Completed ShuffleMapTask(199, 1)
14/04/19 14:10:26 INFO DAGScheduler: Stage199 (combineByKey at ShuffledDStream.scala:42) finished in 0.718 s
14/04/19 14:10:26 INFO DAGScheduler:looking for newly runnable stages
14/04/19 14:10:26 INFO DAGScheduler:running: Set()
14/04/19 14:10:26 INFO DAGScheduler:waiting: Set(Stage 198)
14/04/19 14:10:26 INFO DAGScheduler:failed: Set()
14/04/19 14:10:26 INFO DAGScheduler:Missing parents for Stage 198: List()
14/04/19 14:10:26 INFO DAGScheduler:Submitting Stage 198 (MapPartitionsRDD[1145] at combineByKey atShuffledDStream.scala:42), which is now runnable
14/04/19 14:10:26 INFO TaskSchedulerImpl:Remove TaskSet 199.0 from pool
14/04/19 14:10:26 INFO DAGScheduler:Submitting 1 missing tasks from Stage 198 (MapPartitionsRDD[1145] atcombineByKey at ShuffledDStream.scala:42)
14/04/19 14:10:26 INFO TaskSchedulerImpl:Adding task set 198.0 with 1 tasks
14/04/19 14:10:26 INFO TaskSetManager:Starting task 198.0:0 as TID 850 on executor 2: slave3 (PROCESS_LOCAL)
14/04/19 14:10:26 INFO TaskSetManager:Serialized task 198.0:0 as 2099 bytes in 0 ms
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140282750, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_789 stored as values to memory (estimated size 173.6 KB, free 138.7MB)
14/04/19 14:10:26 INFO MapOutputTrackerMasterActor:Asked to send map output locations for shuffle 45 to spark@slave3:45214
14/04/19 14:10:26 INFOMapOutputTrackerMaster: Size of output statuses for shuffle 45 is 172 bytes
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140460548, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_790 stored as values to memory (estimated size 173.6 KB, free 138.5MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140638346, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_791 stored as values to memory (estimated size 173.6 KB, free 138.3MB)
14/04/19 14:10:26 INFO TaskSetManager:Finished TID 850 in 47 ms on slave3 (progress: 0/1)
14/04/19 14:10:26 INFO TaskSchedulerImpl:Remove TaskSet 198.0 from pool
14/04/19 14:10:26 INFO DAGScheduler:Completed ResultTask(198, 0)
14/04/19 14:10:26 INFO DAGScheduler: Stage198 (take at DStream.scala:586) finished in 0.046 s
14/04/19 14:10:26 INFO SparkContext: Jobfinished: take at DStream.scala:586, took 0.78418979 s
-------------------------------------------
Time: 1397887824000 ms
-------------------------------------------
(,622)
(B?=zS??L6??1?D
?pK7A???e?X??N?c?1?j?b?)'
?R?|?
?,1)
('
<,1)
(?N?8???*<?J?m???9jq??0);Q{?V???W4?WЦ?F?$-?[??<??O?M6?Ь???W??s??@???(??s??????0,1)
(Vl?s?zRB??I!??8?c?hx????q????M=F?J-?de]?,1)
(???????(#?S?P?(|?1Ug 6?4?bY??O?>???5,1)
(P?W?a?&
@?????N??8?Q?|?nGR9?Rs?T?hN????
?????,1)
(??^25?I?????k
?h?#?b??j??zT?]i???R?????c?S?W?1)?.H\?D.?Bx?kW-??~??.??\7?oqdp???}+t%?q?c|????臲ofs?Z?.1??5
?|D??t2h?????t??:?]?I!~?qZ*?i;?Q+??!T!X?O?F???4{???M?=?s?U?S??1??
????}i????e'5????(w+???O??;??5$?H????;???:[1]??A?"?zl??Q_?Mv?qV??#t(???h?H^L?l?4?32???+8U???-?=K?? #V2?EZ?p???v???汐*?玠???f9????#?I?"T?;?x?U?m?M???n~l
f?o?am?1??"?uM?v?
x???7?y=p+?GOu?c?1J?,1)
([1]2??Ev,1)
(??}jO??
???>?????N>?%R1?/??X??{?S5?b??`??y6??j'??\??u=>?[T?Zm?I???=?e:?Td?z2?o5fAa<???!c?@??@??p?J??,k?[?yhW?A?hBv?氣????[1]?A
???
?)vO
A?P?21???on?]?$??,1)
...
14/04/19 14:10:26 INFO JobScheduler:Finished job streaming job 1397887824000 ms.0 from job set of time1397887824000 ms
14/04/19 14:10:26 INFO JobScheduler:Starting job streaming job 1397887824000 ms.1 from job set of time1397887824000 ms
14/04/19 14:10:26 INFO SparkContext:Starting job: saveAsTextFile at DStream.scala:762
14/04/19 14:10:26 INFO DAGScheduler: Gotjob 100 (saveAsTextFile at DStream.scala:762) with 2 output partitions(allowLocal=false)
14/04/19 14:10:26 INFO DAGScheduler: Finalstage: Stage 200 (saveAsTextFile at DStream.scala:762)
14/04/19 14:10:26 INFO DAGScheduler:Parents of final stage: List(Stage 201)
14/04/19 14:10:26 INFO DAGScheduler:Missing parents: List()
14/04/19 14:10:26 INFO DAGScheduler:Submitting Stage 200 (MappedRDD[1159] at saveAsTextFile at DStream.scala:762),which has no missing parents
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140816144, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_792 stored as values to memory (estimated size 173.6 KB, free 138.2MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=140993942, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_793 stored as values to memory (estimated size 173.6 KB, free 138.0MB)
14/04/19 14:10:26 INFO DAGScheduler:Submitting 2 missing tasks from Stage 200 (MappedRDD[1159] at saveAsTextFile atDStream.scala:762)
14/04/19 14:10:26 INFO TaskSchedulerImpl:Adding task set 200.0 with 2 tasks
14/04/19 14:10:26 INFO TaskSetManager: Startingtask 200.0:0 as TID 851 on executor 2: slave3 (PROCESS_LOCAL)
14/04/19 14:10:26 INFO TaskSetManager:Serialized task 200.0:0 as 11653 bytes in 0 ms
14/04/19 14:10:26 INFO TaskSetManager:Starting task 200.0:1 as TID 852 on executor 1: slave1 (PROCESS_LOCAL)
14/04/19 14:10:26 INFO TaskSetManager:Serialized task 200.0:1 as 11653 bytes in 0 ms
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177798) called with curMem=141171740, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_794 stored as values to memory (estimated size 173.6 KB, free 137.8MB)
14/04/19 14:10:26 INFO MemoryStore:ensureFreeSpace(177822) called with curMem=141349538, maxMem=285868032
14/04/19 14:10:26 INFO MemoryStore: Blockbroadcast_795 stored as values to memory (estimated size 173.7 KB, free 137.7MB)
14/04/19 14:10:26 INFOMapOutputTrackerMasterActor: Asked to send map output locations for shuffle 45to spark@slave1:38185
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 INFO FileInputFormat:Total input paths to process : 1
14/04/19 14:10:26 ERROR JobScheduler: Errorgenerating jobs for time 1397887826000 ms
org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs://192.168.178.181:9000/123/src938.jpeg._COPYING_
atorg.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
atorg.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
atorg.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:75)
atorg.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
atorg.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
atscala.Option.getOrElse(Option.scala:120)
atorg.apache.spark.rdd.RDD.partitions(RDD.scala:205)
atorg.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
atorg.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
atorg.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
atorg.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)
atorg.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)
atorg.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)
atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)
atorg.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)
atorg.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
atorg.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:292)
atorg.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
atorg.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
atorg.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
atscala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
atscala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
atscala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
atscala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
atorg.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:115)
atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:160)
atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:160)
atscala.util.Try$.apply(Try.scala:161)
atorg.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:160)
atorg.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:104)
atorg.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:69)
atakka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
atakka.actor.ActorCell.invoke(ActorCell.scala:456)
atakka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
atakka.dispatch.Mailbox.run(Mailbox.scala:219)
atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
atscala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
atscala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
atscala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
atscala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/04/19 14:10:26 INFO TaskSetManager:Finished TID 851 in 513 ms on slave3 (progress: 0/2)
14/04/19 14:10:26 INFO DAGScheduler:Completed ResultTask(200, 0)
14/04/19 14:10:27 INFO TaskSetManager: FinishedTID 852 in 577 ms on slave1 (progress: 1/2)
14/04/19 14:10:27 INFO TaskSchedulerImpl:Remove TaskSet 200.0 from pool
14/04/19 14:10:27 INFO DAGScheduler:Completed ResultTask(200, 1)
14/04/19 14:10:27 INFO DAGScheduler: Stage200 (saveAsTextFile at DStream.scala:762) finished in 0.573 s
14/04/19 14:10:27 INFO SparkContext: Jobfinished: saveAsTextFile at DStream.scala:762, took 0.678888891 s
14/04/19 14:10:27 INFO JobScheduler:Finished job streaming job 1397887824000 ms.1 from job set of time 1397887824000ms
14/04/19 14:10:27 INFO JobScheduler: Totaldelay: 3.070 s for time 1397887824000 ms (execution: 1.514 s)
14/04/19 14:10:27 INFO FileInputDStream:Cleared 2 old files that were older than 1397887822000 ms: 1397887818000 ms,1397887820000 ms
14/04/19 14:10:27 INFO FileInputDStream:Cleared 0 old files that were older than 1397887822000 ms:
14/04/19 14:10:28 INFO FileInputDStream:Finding new files took 13 ms
14/04/19 14:10:28INFO FileInputDStream: New files at time 1397887828000 ms:
hdfs://192.168.178.181:9000/123/src938.jpeg
hdfs://192.168.178.181:9000/123/src939.jpeg
hdfs://192.168.178.181:9000/123/src940.jpeg
hdfs://192.168.178.181:9000/123/src941.jpeg
hdfs://192.168.178.181:9000/123/src942.jpeg
hdfs://192.168.178.181:9000/123/src943.jpeg
hdfs://192.168.178.181:9000/123/src944.jpeg
hdfs://192.168.178.181:9000/123/src945.jpeg
hdfs://192.168.178.181:9000/123/src946.jpeg
hdfs://192.168.178.181:9000/123/src947.jpeg
hdfs://192.168.178.181:9000/123/src948.jpeg
hdfs://192.168.178.181:9000/123/src949.jpeg
hdfs://192.168.178.181:9000/123/src950.jpeg