hadoop词频统计报错,没解决

在尝试使用Hadoop进行本地词频统计时遇到错误,包括Hadoop命令行选项未设置、缺少job.jar文件、警告信息提示参数覆盖以及最终的Shuffle阶段错误。错误源于map输出文件无法找到,导致 ShuffleFetchFailedException。具体错误堆栈显示在本地fetcher过程中,文件D:/tmp/hadoop-qw%20song/mapred/local/localRunner/qw%20song/jobcache/job_local1813302185_0001/attempt_local1813302185_0001_m_000000_0/output/file.out.index丢失。
摘要由CSDN通过智能技术生成

17/08/21 19:57:34 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/08/21 19:57:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/08/21 19:57:34 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/08/21 19:57:34 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
17/08/21 19:57:34 INFO input.FileInputFormat: Total input paths to process : 1
17/08/21 19:57:34 INFO mapreduce.JobSubmitter: number of splits:1
17/08/21 19:57:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1813302185_0001
17/08/21 19:57:34 WARN conf.Configuration: file:/tmp/hadoop-qw song/mapred/staging/qw song1813302185/.staging/job_local1813302185_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
17/08/21 19:57:34 WARN conf.Configuration: file:/tmp/hadoop-qw song/mapred/staging/qw song1813302185/.staging/job_local1813302185_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
17/08/21 19:57:34 WARN conf.Configuration: file:/tmp/hadoop-qw song/mapred/local/localRunner/qw song/job_local1813302185_0001/job_local1813302185_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
17/08/21 19:57:34 WARN conf.Configuration: file:/tmp/hadoop-qw song/mapred/local/localRunner/qw song/job_local1813302185_0001/job_local1813302185_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
17/08/21 19:57:34 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/08/21 19:57:34 INFO mapreduce.Job: Running job: job_local1813302185_0001
17/08/21 19:57:34 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/08/21 19:57:34 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/21 19:57:34 INFO mapred.LocalJobRunner: Waiting for map tasks
17/08/21 19:57:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1813302185_0001_m_000000_0
17/08/21 19:57:34 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
17/08/21 19:57:35 INFO mapred.Task: Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@78c45142
17/08/21 19:57:35 INFO mapred.MapTask: Processing split: hdfs://linux:8020/input/wordcount.txt:0+37
17/08/21 19:57:35 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTaskMapOutputBuffer  
17/08/21 19:57:35 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)  
17/08/21 19:57:35 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100  
17/08/21 19:57:35 INFO mapred.MapTask: soft limit at 83886080  
17/08/21 19:57:35 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600  
17/08/21 19:57:35 INFO mapred.MapTask: kvstart = 26214396; length = 6553600  
17/08/21 19:57:35 INFO mapred.LocalJobRunner:   
17/08/21 19:57:35 INFO mapred.MapTask: Starting flush of map output  
17/08/21 19:57:35 INFO mapred.MapTask: Spilling map output  
17/08/21 19:57:35 INFO mapred.MapTask: bufstart = 0; bufend = 49; bufvoid = 104857600  
17/08/21 19:57:35 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214388(104857552); length = 9/6553600  
17/08/21 19:57:35 INFO mapred.MapTask: Finished spill 0  
17/08/21 19:57:35 INFO mapred.Task: Task:attempt_local1813302185_0001_m_000000_0 is done. And is in the process of committing  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: map  
17/08/21 19:57:35 INFO mapred.Task: Task ‘attempt_local1813302185_0001_m_000000_0’ done.  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local1813302185_0001_m_000000_0  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: map task executor complete.  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: Waiting for reduce tasks  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: Starting task: attempt_local1813302185_0001_r_000000_0  
17/08/21 19:57:35 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.  
17/08/21 19:57:35 INFO mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4dbff46f  
17/08/21 19:57:35 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a052c9e  
17/08/21 19:57:35 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=1303589632, maxSingleShuffleLimit=325897408, mergeThreshold=860369216, ioSortFactor=10, memToMemMergeOutputsThreshold=10  
17/08/21 19:57:35 INFO reduce.EventFetcher: attempt_local1813302185_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events  
17/08/21 19:57:35 INFO mapred.LocalJobRunner: reduce task executor complete.  
17/08/21 19:57:35 WARN mapred.LocalJobRunner: job_local1813302185_0001  
java.lang.Exception: org.apache.hadoop.mapreduce.task.reduce.Shuffle

Hadoop词频统计是指使用Hadoop框架对大规模文本数据进行词频统计。下面是Hadoop词频统计的步骤: 1.准备数据:将需要进行词频统计的文本数据存储在HDFS中。 2.编写Mapper程序:Mapper程序的作用是将输入的文本数据进行分词,并将每个单词作为key,将出现次数作为value输出。 3.编写Reducer程序:Reducer程序的作用是将Mapper程序输出的key-value对进行合并,得到每个单词在文本中出现的总次数。 4.配置Job并提交任务:将Mapper和Reducer程序打包成jar包,并通过Hadoop提供的命令将任务提交到YARN集群中运行。 5.查看结果:任务运行完成后,可以通过Hadoop提供的命令将结果从HDFS中读取出来。 下面是一个简单的Hadoop词频统计的Mapper程序的Python代码: ```python #!/usr/bin/env python import sys # 读取输入数据 for line in sys.stdin: # 去除首尾空格 line = line.strip() # 分词 words = line.split() # 输出每个单词的出现次数 for word in words: print('%s\t%s' % (word, 1)) ``` 下面是一个简单的Hadoop词频统计的Reducer程序的Python代码: ```python #!/usr/bin/env python import sys # 初始化变量 current_word = None current_count = 0 # 处理输入数据 for line in sys.stdin: # 去除首尾空格 line = line.strip() # 解析输入数据 word, count = line.split('\t', 1) count = int(count) # 如果当前单词与上一个单词不同,则输出上一个单词的统计结果 if current_word and current_word != word: print('%s\t%s' % (current_word, current_count)) current_count = 0 # 更新当前单词的统计结果 current_word = word current_count += count # 输出最后一个单词的统计结果 if current_word: print('%s\t%s' % (current_word, current_count)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值