Java调用代码:
String[] params = new String[]{"-i",input,"-o",output,"-ow"};
SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();
sffd.run(params);
hadoop是单节点的;在hadoop上面执行 命令:
mahout seqdirectory -i /tmp/mahout-work-java-sh/20news-all -o /tmp/mahout-work-java-sh/20news-seq -ow
上面命令能成功执行。改用eclipse中java代码调用的时候就报错,错误如下:
2014-07-17 10:16:27 [Thread-3] - [INFO] Map task executor complete.
2014-07-17 10:16:27 [Thread-3] - [WARN] job_local455429773_0001java.lang.Exception: java.io.IOException: Could not obtain block: blk_3191358095169972286_1769 file=/tmp/mahout-work-java-sh/20news-all/rec.sport.baseball/104457
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: Could not obtain block: blk_3191358095169972286_1769 file=/tmp/mahout-work-java-sh/20news-all/rec.sport.baseball/104457
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2460)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2252)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2415)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:175)
at org.apache.mahout.text.WholeFileRecordReader.nextKeyValue(WholeFileRecordReader.java:117)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2014-07-17 10:16:28 [main] - [INFO] Job complete: job_local455429773_0001
2014-07-17 10:16:28 [main] - [INFO] Counters: 11
2014-07-17 10:16:28 [main] - [INFO] File Output Format Counters
2014-07-17 10:16:28 [main] - [INFO] Bytes Written=8007465
2014-07-17 10:16:28 [main] - [INFO] File Input Format Counters
2014-07-17 10:16:28 [main] - [INFO] Bytes Read=0
2014-07-17 10:16:28 [main] - [INFO] FileSystemCounters
2014-07-17 10:16:28 [main] - [INFO] FILE_BYTES_READ=106022950
2014-07-17 10:16:28 [main] - [INFO] HDFS_BYTES_READ=14604249
2014-07-17 10:16:28 [main] - [INFO] FILE_BYTES_WRITTEN=106912382
2014-07-17 10:16:28 [main] - [INFO] HDFS_BYTES_WRITTEN=8007465
2014-07-17 10:16:28 [main] - [INFO] Map-Reduce Framework
2014-07-17 10:16:28 [main] - [INFO] Map input records=8953
2014-07-17 10:16:28 [main] - [INFO] Spilled Records=0
2014-07-17 10:16:28 [main] - [INFO] Total committed heap usage (bytes)=47063040
2014-07-17 10:16:28 [main] - [INFO] SPLIT_RAW_BYTES=1861408
2014-07-17 10:16:28 [main] - [INFO] Map output records=8953
这个错误,我搞了三天;百度google了所有资料都没有找到一个正确的解决方法;比如说,修改 ulimit 或者 是修改RPC的连接数;总之网上说的解决方法都试过了都没有成功,还是报一样的错误。
于是测试了,linux 上面执行命令行,能正常执行; 将eclips中java代码打包放到linux服务器上面也能正常执行。
后来减少了 20news-all文件夹中测试数据的量,发现也能成功执行完成。所以问题应该是出在客户端连接,读取文件连接数的设置出了问题。
但是修改了 eclipse中hadoop客户端连接的配置参数,改了很多都没起效果;到这儿,真的是穷途末路了!哎。。。。。
如果有大神 也遇到这个问题,求指点!