Comparing Document Classification Functions of Lucene and Mahout

最新推荐文章于 2020-11-04 00:21:44 发布

wbj0110

最新推荐文章于 2020-11-04 00:21:44 发布

阅读量234

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/wbj0110/article/details/84609737

版权

机器学习专栏收录该内容

34 篇文章 0 订阅

订阅专栏

Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.

Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.

Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.

Overview of Lucene Document Classification

Lucene’s classifier for document classification is defined as the Classifier interface.

 
         public 
         interface 
         Classifier<T> { 
        
         /** 
        
         * Assign a class (with score) to the given text String 
        
         * @param text a String containing text to be classified 
        
         * @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score 
        
         * @throws IOException If there is a low-level I/O error. 
        
         */ 
        
         public 
         ClassificationResult<T> assignClass(String text)  
         throws 
         IOException; 
        
         /** 
        
         * Train the classifier using the underlying Lucene index 
        
         * @param atomicReader the reader to use to access the Lucene index 
        
         * @param textFieldName the name of the field used to compare documents 
        
         * @param classFieldName the name of the field containing the class assigned to documents 
        
         * @param analyzer the analyzer used to tokenize / filter the unseen text 
        
         * @param query the query to filter which documents use for training 
        
         * @throws IOException If there is a low-level I/O error. 
        
         */ 
        
         public 
         void 
         train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query) 
        
         throws 
         IOException; 
        
         }

You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.

Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().

 
         public 
         class 
         ClassificationResult<T> { 
        
         private 
         final 
         T assignedClass; 
        
         private 
         final 
         double 
         score; 
        
         /** 
        
         * Constructor 
        
         * @param assignedClass the class <code>T</code> assigned by a {@link Classifier} 
        
         * @param score the score for the assignedClass as a <code>double</code> 
        
         */ 
        
         public 
         ClassificationResult(T assignedClass,  
         double 
         score) { 
        
         this 
         .assignedClass = assignedClass; 
        
         this 
         .score = score; 
        
         } 
        
         /** 
        
         * retrieve the result class 
        
         * @return a <code>T</code> representing an assigned class 
        
         */ 
        
         public 
         T getAssignedClass() { 
        
         return 
         assignedClass; 
        
         } 
        
         /** 
        
         * retrieve the result score 
        
         * @return a <code>double</code> representing a result score 
        
         */ 
        
         public 
         double 
         getScore() { 
        
         return 
         score; 
        
         } 
        
         }

Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.

Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.

As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.

Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.

Using Lucene SimpleNaiveBayesClassifier

SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).

We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.

 
         <? 
         xml 
         version 
         = 
         "1.0" 
         encoding 
         = 
         "UTF-8" 
         ?> 
        
 
         < 
         schema 
         name 
         = 
         "example" 
         version 
         = 
         "1.5" 
         > 
        
 
            
         < 
         fields 
         > 
        
 
              
         < 
         field 
         name 
         = 
         "url" 
         type 
         = 
         "string" 
         indexed 
         = 
         "true" 
         stored 
         = 
         "true" 
         required 
         = 
         "true" 
         multiValued 
         = 
         "false" 
         /> 
        
 
              
         < 
         field 
         name 
         = 
         "cat" 
         type 
         = 
         "string" 
         indexed 
         = 
         "true" 
         stored 
         = 
         "true" 
         required 
         = 
         "true" 
         multiValued 
         = 
         "false" 
         /> 
        
 
              
         < 
         field 
         name 
         = 
         "title" 
         type 
         = 
         "text_ja" 
         indexed 
         = 
         "true" 
         stored 
         = 
         "true" 
         multiValued 
         = 
         "false" 
         /> 
        
 
              
         < 
         field 
         name 
         = 
         "body" 
         type 
         = 
         "text_ja" 
         indexed 
         = 
         "true" 
         stored 
         = 
         "true" 
         multiValued 
         = 
         "true" 
         /> 
        
 
              
         < 
         field 
         name 
         = 
         "date" 
         type 
         = 
         "date" 
         indexed 
         = 
         "true" 
         stored 
         = 
         "true" 
         /> 
        
 
            
         </ 
         fields 
         > 
        
 
            
         < 
         uniqueKey 
         >url</ 
         uniqueKey 
         > 
        
 
            
         < 
         types 
         > 
        
 
              
         < 
         fieldType 
         name 
         = 
         "string" 
         class 
         = 
         "solr.StrField" 
         sortMissingLast 
         = 
         "true" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "boolean" 
         class 
         = 
         "solr.BoolField" 
         sortMissingLast 
         = 
         "true" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "int" 
         class 
         = 
         "solr.TrieIntField" 
         precisionStep 
         = 
         "0" 
         positionIncrementGap 
         = 
         "0" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "float" 
         class 
         = 
         "solr.TrieFloatField" 
         precisionStep 
         = 
         "0" 
         positionIncrementGap 
         = 
         "0" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "long" 
         class 
         = 
         "solr.TrieLongField" 
         precisionStep 
         = 
         "0" 
         positionIncrementGap 
         = 
         "0" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "double" 
         class 
         = 
         "solr.TrieDoubleField" 
         precisionStep 
         = 
         "0" 
         positionIncrementGap 
         = 
         "0" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "date" 
         class 
         = 
         "solr.TrieDateField" 
         precisionStep 
         = 
         "0" 
         positionIncrementGap 
         = 
         "0" 
         /> 
        
 
              
         < 
         fieldType 
         name 
         = 
         "text_ja" 
         class 
         = 
         "solr.TextField" 
         positionIncrementGap 
         = 
         "100" 
         autoGeneratePhraseQueries 
         = 
         "false" 
         > 
        
 
                
         < 
         analyzer 
         > 
        
 
                  
         < 
         tokenizer 
         class 
         = 
         "solr.JapaneseTokenizerFactory" 
         mode 
         = 
         "search" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.JapaneseBaseFormFilterFactory" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.JapanesePartOfSpeechStopFilterFactory" 
         tags 
         = 
         "lang/stoptags_ja.txt" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.CJKWidthFilterFactory" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.StopFilterFactory" 
         ignoreCase 
         = 
         "true" 
         words 
         = 
         "lang/stopwords_ja.txt" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.JapaneseKatakanaStemFilterFactory" 
         minimumLength 
         = 
         "4" 
         /> 
        
 
                  
         < 
         filter 
         class 
         = 
         "solr.LowerCaseFilterFactory" 
         /> 
        
 
                
         </ 
         analyzer 
         > 
        
 
              
         </ 
         fieldType 
         > 
        
 
            
         </ 
         types 
         > 
        
 
         </ 
         schema 
         > 
        

Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.

Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.

 
         public 
         final 
         class 
         TestLuceneIndexClassifier { 
        
         public 
         static 
         final 
         String INDEX =  
         "solr2/collection1/data/index" 
         ; 
        
         public 
         static 
         final 
         String[] CATEGORIES = { 
        
         "dokujo-tsushin" 
         , 
        
         "it-life-hack" 
         , 
        
         "kaden-channel" 
         , 
        
         "livedoor-homme" 
         , 
        
         "movie-enter" 
         , 
        
         "peachy" 
         , 
        
         "smax" 
         , 
        
         "sports-watch" 
         , 
        
         "topic-news" 
        
         }; 
        
         private 
         static 
         int 
         [][] counts; 
        
         private 
         static 
         Map<String, Integer> catindex; 
        
         public 
         static 
         void 
         main(String[] args)  
         throws 
         Exception { 
        
         init(); 
        
         final 
         long 
         startTime = System.currentTimeMillis(); 
        
         SimpleNaiveBayesClassifier classifier =  
         new 
         SimpleNaiveBayesClassifier(); 
        
         IndexReader reader = DirectoryReader.open(dir()); 
        
         AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader); 
        
         classifier.train(ar,  
         "body" 
         ,  
         "cat" 
         ,  
         new 
         JapaneseAnalyzer(Version.LUCENE_46)); 
        
         final 
         int 
         maxdoc = reader.maxDoc(); 
        
         for 
         ( 
         int 
         i =  
         0 
         ; i < maxdoc; i++){ 
        
         Document doc = ar.document(i); 
        
         String correctAnswer = doc.get( 
         "cat" 
         ); 
        
         final 
         int 
         cai = idx(correctAnswer); 
        
         ClassificationResult<BytesRef> result = classifier.assignClass(doc.get( 
         "body" 
         )); 
        
         String classified = result.getAssignedClass().utf8ToString(); 
        
         final 
         int 
         cli = idx(classified); 
        
         counts[cai][cli]++; 
        
         } 
        
         final 
         long 
         endTime = System.currentTimeMillis(); 
        
         final 
         int 
         elapse = ( 
         int 
         )(endTime - startTime) /  
         1000 
         ; 
        
         // print results 
        
         int 
         fc =  
         0 
         , tc =  
         0 
         ; 
        
         for 
         ( 
         int 
         i =  
         0 
         ; i < CATEGORIES.length; i++){ 
        
         for 
         ( 
         int 
         j =  
         0 
         ; j < CATEGORIES.length; j++){ 
        
         System.out.printf( 
         " %3d " 
         , counts[i][j]); 
        
         if 
         (i == j){ 
        
         tc += counts[i][j]; 
        
         } 
        
         else 
         { 
        
         fc += counts[i][j]; 
        
         } 
        
         } 
        
         System.out.println(); 
        
         } 
        
         float 
         accrate = ( 
         float 
         )tc / ( 
         float 
         )(tc + fc); 
        
         float 
         errrate = ( 
         float 
         )fc / ( 
         float 
         )(tc + fc); 
        
         System.out.printf( 
         "\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n" 
         , accrate, errrate, elapse, maxdoc); 
        
         reader.close(); 
        
         } 
        
         static 
         Directory dir()  
         throws 
         IOException { 
        
         return 
         FSDirectory.open( 
         new 
         File(INDEX)); 
        
         } 
        
         static 
         void 
         init(){ 
        
         counts =  
         new 
         int 
         [CATEGORIES.length][CATEGORIES.length]; 
        
         catindex =  
         new 
         HashMap<String, Integer>(); 
        
         for 
         ( 
         int 
         i =  
         0 
         ; i < CATEGORIES.length; i++){ 
        
         catindex.put(CATEGORIES[i], i); 
        
         } 
        
         } 
        
         static 
         int 
         idx(String cat){ 
        
         return 
         catindex.get(cat); 
        
         } 
        
         }

Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.

Executing this program displays the followings.

 
         760    0    4   23   37   37    2    2    5 
        
         40  656    7   44   25    4   90    1    3 
        
         87   57  392  102   68   24  113    5   16 
        
         40   15    6  391   33    8   16    2    0 
        
         14    2    0    5  845    2    0    1    1 
        
         134    2    2   26  107  549   19    3    0 
        
         43   36   13   17   26   36  693    5    1 
        
         6    0    0   23   35    0    1  829    6 
        
         10    9    9   25   66    6    5   45  595  
        
         *** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs

The classification accuracy rate went up to 77%.

Using Lucene KNearestNeighborClassifier

Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.

The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.

Executing the same program as KNearestNeighborClassifier will display the following when k=1.

 
         724   14   28   22    6   30    8   18   20 
        
         121  630   41   13    2    9   35    6   13 
        
         165   28  582   10    5   16   26    7   25 
        
         229   15   15  213    6   14    6    2   11 
        
         134   37   15    8  603   12   19    7   35 
        
         266   38   39   24   14  412   22    9   18 
        
         810   16    1    3    2    3   32    1    2 
        
         316   18   14   12    5    7    8  439   81 
        
         362   17   29   10    1    7    7   16  321  
        
         *** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs

Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.

 
         652    5   78    3    7   40   13   38   34 
        
         127  540   82   15    1   10   58   23   14 
        
         169   34  553    3    7   16   38   15   29 
        
         242   10   32  156   12   13   15   10   21 
        
         136   30   21    9  592   11   19   15   37 
        
         309   34   58    5   23  318   40   28   27 
        
         810    8    3    1    0   10   37    1    0 
        
         312    8   44    7    5    2   13  442   67 
        
         362   11   45    5    6   10   16   34  281  
        
         *** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs

Document Classification by NLP4L and Mahout

If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.

The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.

Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.

Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.

 
         ======================================================= 
        
         Summary 
        
         ------------------------------------------------------- 
        
         Correctly Classified Instances          :       7128       96.7689% 
        
         Incorrectly Classified Instances        :        238        3.2311% 
        
         Total Classified Instances              :       7366 
        
         ======================================================= 
        
         Confusion Matrix 
        
         ------------------------------------------------------- 
        
         a       b       c       d       e       f       g       h       i       <--Classified as 
        
         823     1       1       6       12      19      2       4       2        |  870     a     = dokujo-tsushin 
        
         1       848     2       1       0       1       11      4       2        |  870     b     = it-life-hack 
        
         5       6       830     1       1       0       3       1       17       |  864     c     = kaden-channel 
        
         2       6       6       486     3       1       6       0       0        |  510     d     = livedoor-homme 
        
         0       0       1       1       865     1       0       1       1        |  870     e     = movie-enter 
        
         31      3       6       12      14      762     6       4       4        |  842     f     = peachy 
        
         0       0       2       0       0       1       867     0       0        |  870     g     = smax 
        
         0       0       0       1       0       0       0       897     2        |  900     h     = sports-watch 
        
         2       4       1       1       0       0       0       12      750      |  770     i     = topic-news 
        
         ======================================================= 
        
         Statistics 
        
         ------------------------------------------------------- 
        
         Kappa                                        0.955 
        
         Accuracy                                   96.7689% 
        
         Reliability                                87.0076% 
        
         Reliability (standard deviation)             0.307

Also, Mahout Random Forest shows accuracy rate of 97%.

 
         ======================================================= 
        
         Summary 
        
         ------------------------------------------------------- 
        
         Correctly Classified Instances          :       7156       97.1359% 
        
         Incorrectly Classified Instances        :        211        2.8641% 
        
         Total Classified Instances              :       7367 
        
         ======================================================= 
        
         Confusion Matrix 
        
         ------------------------------------------------------- 
        
         a       b       c       d       e       f       g       h       i       <--Classified as 
        
         838     5       2       6       3       7       2       0       1        |  864     a     = kaden-channel 
        
         0       895     0       1       4       0       0       0       0        |  900     b     = sports-watch 
        
         0       0       869     0       0       1       0       0       0        |  870     c     = smax 
        
         0       2       0       839     1       0       14      2       12       |  870     d     = dokujo-tsushin 
        
         1       17      0       0       748     0       2       0       2        |  770     e     = topic-news 
        
         1       5       0       1       5       855     2       0       1        |  870     f     = it-life-hack 
        
         0       1       0       23      0       0       793     1       24       |  842     g     = peachy 
        
         0       11      0       14      1       2       18      454     11       |  511     h     = livedoor-homme 
        
         0       1       0       2       0       0       2       0       865      |  870     i     = movie-enter 
        
         ======================================================= 
        
         Statistics 
        
         ------------------------------------------------------- 
        
         Kappa                                       0.9608 
        
         Accuracy                                   97.1359% 
        
         Reliability                                87.0627% 
        
         Reliability (standard deviation)            0.3076

Summary

In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.

I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.

I hope this article will help you all in some way.

http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

wbj0110

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Comparing Document Classification Functions of Lucene and Mahout

Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to ...
复制链接

扫一扫

专栏目录