bayes示例数据20news-all转换成20news-seq报错

Java调用代码:

String[] params = new String[]{"-i",input,"-o",output,"-ow"};

SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();
sffd.run(params);


hadoop是单节点的;在hadoop上面执行 命令:

mahout seqdirectory -i /tmp/mahout-work-java-sh/20news-all -o /tmp/mahout-work-java-sh/20news-seq -ow 

上面命令能成功执行。改用eclipse中java代码调用的时候就报错,错误如下:


2014-07-17 10:16:27 [Thread-3] - [INFO] Map task executor complete.

2014-07-17 10:16:27 [Thread-3] - [WARN] job_local455429773_0001
java.lang.Exception: java.io.IOException: Could not obtain block: blk_3191358095169972286_1769 file=/tmp/mahout-work-java-sh/20news-all/rec.sport.baseball/104457
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: Could not obtain block: blk_3191358095169972286_1769 file=/tmp/mahout-work-java-sh/20news-all/rec.sport.baseball/104457
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2460)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2252)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2415)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:175)
at org.apache.mahout.text.WholeFileRecordReader.nextKeyValue(WholeFileRecordReader.java:117)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2014-07-17 10:16:28 [main] - [INFO] Job complete: job_local455429773_0001
2014-07-17 10:16:28 [main] - [INFO] Counters: 11
2014-07-17 10:16:28 [main] - [INFO]   File Output Format Counters 
2014-07-17 10:16:28 [main] - [INFO]     Bytes Written=8007465
2014-07-17 10:16:28 [main] - [INFO]   File Input Format Counters 
2014-07-17 10:16:28 [main] - [INFO]     Bytes Read=0
2014-07-17 10:16:28 [main] - [INFO]   FileSystemCounters
2014-07-17 10:16:28 [main] - [INFO]     FILE_BYTES_READ=106022950
2014-07-17 10:16:28 [main] - [INFO]     HDFS_BYTES_READ=14604249
2014-07-17 10:16:28 [main] - [INFO]     FILE_BYTES_WRITTEN=106912382
2014-07-17 10:16:28 [main] - [INFO]     HDFS_BYTES_WRITTEN=8007465
2014-07-17 10:16:28 [main] - [INFO]   Map-Reduce Framework
2014-07-17 10:16:28 [main] - [INFO]     Map input records=8953
2014-07-17 10:16:28 [main] - [INFO]     Spilled Records=0
2014-07-17 10:16:28 [main] - [INFO]     Total committed heap usage (bytes)=47063040
2014-07-17 10:16:28 [main] - [INFO]     SPLIT_RAW_BYTES=1861408

2014-07-17 10:16:28 [main] - [INFO]     Map output records=8953




这个错误,我搞了三天;百度google了所有资料都没有找到一个正确的解决方法;比如说,修改 ulimit 或者 是修改RPC的连接数;总之网上说的解决方法都试过了都没有成功,还是报一样的错误。


于是测试了,linux 上面执行命令行,能正常执行; 将eclips中java代码打包放到linux服务器上面也能正常执行。


后来减少了 20news-all文件夹中测试数据的量,发现也能成功执行完成。所以问题应该是出在客户端连接,读取文件连接数的设置出了问题。

但是修改了 eclipse中hadoop客户端连接的配置参数,改了很多都没起效果;到这儿,真的是穷途末路了!哎。。。。。

如果有大神 也遇到这个问题,求指点!


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值