mahout lucene vector 错误

最新推荐文章于 2024-06-18 00:28:21 发布

michzel

最新推荐文章于 2024-06-18 00:28:21 发布

阅读量1.8k

点赞数

分类专栏： MAHOUT 学习文章标签： lucene vector processing thread output hadoop

本文链接：https://blog.csdn.net/michzel/article/details/7052464

版权

MAHOUT 学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

昨天mahout将索引转换为向量时总是报错，记录如下：

首先，建立索引时一定要将filed设置为向量，如：

Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);

其次，启动hadoop，切换到MAHOUT_HOME目录，运行：

/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2

以前有bug会报错如下：

WARNING: No lucene.vector.props found on classpath, will use command-line argume
nts only
Aug 5, 2010 11:17:40 AM org.slf4j.impl.JCLLoggerAdapter error
SEVERE: Exception
org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:125)

解决办法:将mahout/conf 下的lucenevector.props 改为 lucene.vector.props即可

昨天我碰到的错误是将上述命令 --flag 写成了-flag，汗。查找了N多资料，最后写信给sean.owen。很快回复指出我的错误，再次感谢。

今天运行时碰到如下错误：

lucene.LuceneIterator: There are too many documents that do not have a term vector for title-clustering
Exception in thread “main” java.lang.IllegalStateException: There are too many documents that do not have a term vector for title-clustering.

将其中的--field 由contents换为另外一个field name 则程序可以运行，不知何故。

今天终于搞清楚是analyzer的问题，如果使用mahout自带的lucene analyzer那么不会报错，即时是lucene 自带的中文分词。因为我用的是中科院中文分词，可能分词对内容没有分出来，估计这种可能性比较大，下一步解决这个问题。