Twenty Newsgroups Classification任务之二seq2sparse（1）

最新推荐文章于 2023-02-11 09:00:00 发布

wbj0110

最新推荐文章于 2023-02-11 09:00:00 发布

阅读量97

点赞数

分类专栏： Mahout 文章标签： Mahout

本文链接：https://blog.csdn.net/wbj0110/article/details/84608020

版权

Mahout 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

[java]view plaincopy
        
    
Usage:                                                                            
 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize             
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma        
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>        
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>          
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]         
Options                                                                           
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default         
                                      Value: 2                                    
  --analyzerName (-a) analyzerName    The class name of the analyzer              
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB    
  --output (-o) output                The directory pathname for output.          
  --input (-i) input                  Path to job input directory.                
  --minDF (-md) minDF                 The minimum document frequency.  Default    
                                      is 1                                        
  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors     
                                      to be used, expressed in times the          
                                      standard deviation (sigma) of the           
                                      document frequencies of these vectors.      
                                      Can be used to remove really high           
                                      frequency terms. Expressed as a double      
                                      value. Good value to be specified is 3.0.   
                                      In case the value is less then 0 no         
                                      vectors will be filtered out. Default is    
                                      -1.0.  Overrides maxDFPercent               
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.      
                                      Can be used to remove really high           
                                      frequency terms. Expressed as an integer    
                                      between 0 and 100. Default is 99.  If       
                                      maxDFSigma is also set, it will override    
                                      this value.                                 
  --weight (-wt) weight               The kind of weight to use. Currently TF     
                                      or TFIDF                                    
  --norm (-n) norm                    The norm to use, expressed as either a      
                                      float or "INF" if you want to use the       
                                      Infinite norm.  Must be greater or equal    
                                      to 0.  The default is not to normalize      
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood        
                                      Ratio(Float)  Default is 1.0                
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.          
                                      Default Value: 1                            
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to    
                                      create (2 = bigrams, 3 = trigrams, etc)     
                                      Default Value:1                             
  --overwrite (-ow)                   If set, overwrite the output directory      
  --help (-h)                         Print out help                              
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should    
                                      be SequentialAccessVectors. If set true     
                                      else false                                  
  --namedVector (-nv)                 (Optional) Whether output vectors should    
                                      be NamedVectors. If set true else false     
  --logNormalize (-lnorm)             (Optional) Whether output vectors should    
                                      be logNormalize. If set true else false   

在昨天算法的终端信息中该步骤的调用命令如下：

[python]view plaincopy
        
./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考http://zh.wikipedia.org/wiki/TF-IDF 。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

[java]view plaincopy
        
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

[java]view plaincopy
        
    
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {  
    TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));  
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);  
    StringTuple document = new StringTuple();  
    stream.reset();  
    while (stream.incrementToken()) {  
      if (termAtt.length() > 0) {  
        document.add(new String(termAtt.buffer(), 0, termAtt.length()));  
      }  
    }  
    context.write(key, document);  
  }  

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考：http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ，其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) ：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序：

[java]view plaincopy
        
    
package mahout.fansy.test.bayes;  
  
import java.io.IOException;  
import java.io.StringReader;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.io.Text;  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.TokenStream;  
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;  
import org.apache.mahout.common.ClassUtils;  
import org.apache.mahout.common.StringTuple;  
import org.apache.mahout.vectorizer.DefaultAnalyzer;  
import org.apache.mahout.vectorizer.DocumentProcessor;  
  
public class TestSequenceFileTokenizerMapper {  
  
    /** 
     * @param args 
     */  
    private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",  
Analyzer.class);  
    public static void main(String[] args) throws IOException {  
        testMap();  
    }  
      
    public static void testMap() throws IOException{  
        Text key=new Text("4096");  
        Text value=new Text("today is also late.what about tomorrow?");  
        TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));  
        CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);  
        StringTuple document = new StringTuple();  
        stream.reset();  
        while (stream.incrementToken()) {  
          if (termAtt.length() > 0) {  
            document.add(new String(termAtt.buffer(), 0, termAtt.length()));  
          }  
        }  
        System.out.println("key:"+key.toString()+",document"+document);  
    }  
  
}  

得出的结果如下：

[plain]view plaincopy
        
key:4096,document[today, also, late.what, about, tomorrow]

其中，TokenStream有一个stopwords属性，值为：[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]，所以当遇到这些单词的时候就不进行计算了。

http://blog.csdn.net/fansy1990/article/details/10478515

http://soledede.com/

大家可以加我个人微信号：scccdgf

或者关注soledede的微信公众号：soledede

微信公众号：

wbj0110

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Twenty Newsgroups Classification任务之二seq2sparse（1）

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）Ve...
复制链接

扫一扫

专栏目录