使用libsvm实现文本分类

最新推荐文章于 2021-01-25 11:10:54 发布

小飞侠-2

最新推荐文章于 2021-01-25 11:10:54 发布

阅读量1w

点赞数

分类专栏：算法

算法专栏收录该内容

91 篇文章 4 订阅

订阅专栏

文本分类，首先它是分类问题，应该对应着分类过程的两个重要的步骤，一个是使用训练数据集训练分类器，另一个就是使用测试数据集来评价分类器的分类精度。然而，作为文本分类，它还具有文本这样的约束，所以对于文本来说，需要额外的处理过程，我们结合使用libsvm从宏观上总结一下，基于libsvm实现文本分类实现的基本过程，如下所示：

选择文本训练数据集和测试数据集：训练集和测试集都是类标签已知的；
训练集文本预处理：这里主要包括分词、去停用词、建立词袋模型（倒排表）；
选择文本分类使用的特征向量（词向量）：最终的目标是使得最终选出的特征向量在多个类别之间具有一定的类别区分度，可以使用相关有效的技术去实现特征向量的选择，由于分词后得到大量的词，通过选择降维技术能很好地减少计算量，还能维持分类的精度；
输出libsvm支持的量化的训练样本集文件：类别名称、特征向量中每个词元素分别到数字编号的映射转换，以及基于类别和特征向量来量化文本训练集，能够满足使用libsvm训练所需要的数据格式；
测试数据集预处理：同样包括分词（需要和训练过程中使用的分词器一致）、去停用词、建立词袋模型（倒排表），但是这时需要加载训练过程中生成的特征向量，用特征向量去排除多余的不在特征向量中的词（也称为降维）；
输出libsvm支持的量化的测试样本集文件：格式和训练数据集的预处理阶段的输出相同。
使用libsvm训练文本分类器：使用训练集预处理阶段输出的量化的数据集文件，这个阶段也需要做很多工作（后面会详细说明），最终输出分类模型文件
使用libsvm验证分类模型的精度：使用测试集预处理阶段输出的量化的数据集文件，和分类模型文件来验证分类的精度。
分类模型参数寻优：如果经过libsvm训练出来的分类模型精度很差，可以通过libsvm自带的交叉验证（Cross Validation）功能来实现参数的寻优，通过搜索参数取值空间来获取最佳的参数值，使分类模型的精度满足实际分类需要。

基于上面的分析，分别对上面每个步骤进行实现，最终完成一个分类任务。

数据集选择

我们选择了搜狗的语料库，可以参考后面的链接下载语料库文件。这里，需要注意的是，分别准备一个训练数据集和一个测试数据集，不要让两个数据集有交叉。例如，假设有C个类别，选择每个分类的下的N篇文档作为训练集，总共的训练集文档数量为C*N，剩下的每一类下M篇作为测试数据集使用，测试数据集总共文档数等于C*M。

数据集文本预处理

我们选择使用ICTCLAS分词器，使用该分词器可以不需要预先建立自己的词典，而且分词后已经标注了词性，可以根据词性对词进行一定程度过滤（如保留名词，删除量词、叹词等等对分类没有意义的词汇）。
下载ICTCLAS软件包，如果是在Win7 64位系统上使用Java实现分词，选择如下两个软件包：

20131115123549_nlpir_ictclas2013_u20131115_release.zip
20130416090323_Win-64bit-JNI-lib.zip

将第二个软件包中的NLPIR_JNI.dll文件拷贝到C:\Windows\System32目录下面，将第一个软件包中的Data目录和NLPIR.dll、NLPIR.lib、NLPIR.h、NLPIR.lib文件拷贝到Java工程根目录下面。
对于其他操作系统，可以到ICTCLAS网站（ http://ictclas.nlpir.org/downloads ）下载对应版本的软件包。
下面，我们使用Java实现分词，定义分词器接口，以便切换其他分词器实现时，容易扩展，如下所示：

package org.shirdrn.document.processor.common;import java.io.File;import java.util.Map;public interface DocumentAnalyzer {     Map<String, Term> analyze(File file);}

增加一个外部的停用词表，这个我们直接封装到抽象类AbstractDocumentAnalyzer中去了，该抽象类就是从一个指定的文件或目录读取停用词文件，将停用词加载到内存中，在分词的过程中对词进行进一步的过滤。然后基于上面的实现，给出包裹ICTCLAS分词器的实现，代码如下所示：

package org.shirdrn.document.processor.analyzer;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStreamReader;import java.util.HashMap;import java.util.Map;import kevin.zhang.NLPIR;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.shirdrn.document.processor.common.DocumentAnalyzer;import org.shirdrn.document.processor.common.Term;import org.shirdrn.document.processor.config.Configuration;public class IctclasAnalyzer extends AbstractDocumentAnalyzer implements DocumentAnalyzer {     private static final Log LOG = LogFactory.getLog(IctclasAnalyzer.class);     private final NLPIR analyzer;         public IctclasAnalyzer(Configuration configuration) {          super(configuration);          analyzer = new NLPIR();          try {               boolean initialized = NLPIR.NLPIR_Init(".".getBytes(charSet), 1);               if(!initialized) {                    throw new RuntimeException("Fail to initialize!");               }          } catch (Exception e) {               throw new RuntimeException("", e);          }     }     @Override     public Map<String, Term> analyze(File file) {          String doc = file.getAbsolutePath();          LOG.info("Process document: file=" + doc);          Map<String, Term> terms = new HashMap<String, Term>(0);          BufferedReader br = null;          try {               br = new BufferedReader(new InputStreamReader(new FileInputStream(file), charSet));               String line = null;               while((line = br.readLine()) != null) {                    line = line.trim();                    if(!line.isEmpty()) {                         byte nativeBytes[] = analyzer.NLPIR_ParagraphProcess(line.getBytes(charSet), 1);                         String content = new String(nativeBytes, 0, nativeBytes.length, charSet);                         String[] rawWords = content.split("\\s+");                         for(String rawWord : rawWords) {                              String[] words = rawWord.split("/");                              if(words.length == 2) {                                   String word = words[0];                                   String lexicalCategory = words[1];                                   Term term = terms.get(word);                                   if(term == null) {                                        term = new Term(word);                                        // TODO set lexical category                                        term.setLexicalCategory(lexicalCategory);                                        terms.put(word, term);                                   }                                   term.incrFreq();                                   LOG.debug("Got word: word=" + rawWord);                              }                         }                    }               }          } catch (IOException e) {               e.printStackTrace();          } finally {               try {                    if(br != null) {                         br.close();                    }               } catch (IOException e) {                    LOG.warn(e);               }          }          return terms;     }}

它是对一个文件进行读取，然后进行分词，去停用词，最后返回的Map 包含了 <词的文本字符串词的相关属性=""> 的集合，此属性包括词性（Lexical Category）、词频、TF等信息。
这样，遍历数据集目录和文件，就能去将全部的文档分词，最终构建词袋模型。我们使用Java中集合来存储文档、词、类别之间的关系，如下所示：

private int totalDocCount;     private final List<String> labels = new ArrayList<String>();     // Map<类别, 文档数量>     private final Map<String, Integer> labelledTotalDocCountMap = new HashMap<String, Integer>();     //  Map<类别, Map<文档 ,Map<词, 词信息>>>     private final Map<String, Map<String, Map<String, Term>>> termTable =               new HashMap<String, Map<String, Map<String, Term>>>();     //  Map<词 ,Map<类别, Set<文档>>>     private final Map<String, Map<String, Set<String>>> invertedTable =               new HashMap<String, Map<String, Set<String>>>();

基于训练数据集选择特征向量

上面已经构建好词袋模型，包括相关的文档和词等的关系信息。现在我们来选择用来建立分类模型的特征词向量，首先要选择一种度量，来有效地选择出特征词向量。基于论文《A comparative study on feature selection in text categorization》，我们选择基于卡方统计量（chi-square statistic， CHI）技术来实现选择，这里根据计算公式：
chi-formula
其中，公式中各个参数的含义，说明如下：

N：训练数据集文档总数
A：在一个类别中，包含某个词的文档的数量
B：在一个类别中，排除该类别，其他类别包含某个词的文档的数量
C：在一个类别中，不包含某个词的文档的数量
D：在一个类别中，不包含某个词也不在该类别中的文档的数量

要想进一步了解，可以参考这篇论文。
使用卡方统计量，为每个类别下的每个词都进行计算得到一个CHI值，然后对这个类别下的所有的词基于CHI值进行排序，选择出最大的topN个词（很显然使用堆排序算法更合适）；最后将多个类别下选择的多组topN个词进行合并，得到最终的特征向量。
其实，这里可以进行一下优化，每个类别下对应着topN个词，在合并的时候可以根据一定的标准，将各个类别都出现的词给出一个比例，超过指定比例的可以删除掉，这样可以使特征向量在多个类别分类过程中更具有区分度。这里，我们只是做了个简单的合并。
我们看一下，用到的存储结构，使用Java的集合来存储：

// Map<label, Map<word, term>>     private final Map<String, Map<String, Term>> chiLabelToWordsVectorsMap = new HashMap<String, Map<String, Term>>(0);     // Map<word, term>, finally merged vector     private final Map<String, Term> chiMergedTermVectorMap = new HashMap<String, Term>(0);

下面，实现特征向量选择计算的实现，代码如下所示：

package org.shirdrn.document.processor.component.train;import java.util.Iterator;import java.util.Map;import java.util.Map.Entry;import java.util.Set;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.shirdrn.document.processor.common.AbstractComponent;import org.shirdrn.document.processor.common.Context;import org.shirdrn.document.processor.common.Term;import org.shirdrn.document.processor.utils.SortUtils;public class FeatureTermVectorSelector extends AbstractComponent {     private static final Log LOG = LogFactory.getLog(FeatureTermVectorSelector.class);     private final int keptTermCountEachLabel;          public FeatureTermVectorSelector(Context context) {          super(context);          keptTermCountEachLabel = context.getConfiguration().getInt("processor.each.label.kept.term.count", 3000);     }     @Override     public void fire() {          // compute CHI value for selecting feature terms           // after sorting by CHI value          for(String label : context.getVectorMetadata().getLabels()) {               // for each label, compute CHI vector               LOG.info("Compute CHI for: label=" + label);               processOneLabel(label);          }                    // sort and select CHI vectors          Iterator<Entry<String, Map<String, Term>>> chiIter =                     context.getVectorMetadata().chiLabelToWordsVectorsIterator();          while(chiIter.hasNext()) {               Entry<String, Map<String, Term>> entry = chiIter.next();               String label = entry.getKey();               LOG.info("Sort CHI terms for: label=" + label + ", termCount=" + entry.getValue().size());               Entry<String, Term>[] a = sort(entry.getValue());               for (int i = 0; i < Math.min(a.length, keptTermCountEachLabel); i++) {                    Entry<String, Term> termEntry = a[i];                    // merge CHI terms for all labels                    context.getVectorMetadata().addChiMergedTerm(termEntry.getKey(), termEntry.getValue());               }          }     }          @SuppressWarnings("unchecked")     private Entry<String, Term>[] sort(Map<String, Term> terms) {          Entry<String, Term>[] a = new Entry[terms.size()];          a = terms.entrySet().toArray(a);          SortUtils.heapSort(a, true, keptTermCountEachLabel);          return a;     }     private void processOneLabel(String label) {          Iterator<Entry<String, Map<String, Set<String>>>> iter =                     context.getVectorMetadata().invertedTableIterator();          while(iter.hasNext()) {               Entry<String, Map<String, Set<String>>> entry = iter.next();               String word = entry.getKey();               Map<String, Set<String>> labelledDocs = entry.getValue();                              // A: doc count containing the word in this label               int docCountContainingWordInLabel = 0;               if(labelledDocs.get(label) != null) {                    docCountContainingWordInLabel = labelledDocs.get(label).size();               }                              // B: doc count containing the word not in this label               int docCountContainingWordNotInLabel = 0;               Iterator<Entry<String, Set<String>>> labelledIter =                          labelledDocs.entrySet().iterator();               while(labelledIter.hasNext()) {                    Entry<String, Set<String>> labelledEntry = labelledIter.next();                    String tmpLabel = labelledEntry.getKey();                    if(!label.equals(tmpLabel)) {                         docCountContainingWordNotInLabel += entry.getValue().size();                    }               }                              // C: doc count not containing the word in this label               int docCountNotContainingWordInLabel =                          getDocCountNotContainingWordInLabel(word, label);                              // D: doc count not containing the word not in this label               int docCountNotContainingWordNotInLabel =                          getDocCountNotContainingWordNotInLabel(word, label);                              // compute CHI value               int N = context.getVectorMetadata().getTotalDocCount();               int A = docCountContainingWordInLabel;               int B = docCountContainingWordNotInLabel;               int C = docCountNotContainingWordInLabel;               int D = docCountNotContainingWordNotInLabel;               int temp = (A*D-B*C);               double chi = (double) N*temp*temp / (A+C)*(A+B)*(B+D)*(C+D);               Term term = new Term(word);               term.setChi(chi);               context.getVectorMetadata().addChiTerm(label, word, term);          }     }     private int getDocCountNotContainingWordInLabel(String word, String label) {          int count = 0;          Iterator<Entry<String,Map<String,Map<String,Term>>>> iter =                     context.getVectorMetadata().termTableIterator();          while(iter.hasNext()) {               Entry<String,Map<String,Map<String,Term>>> entry = iter.next();               String tmpLabel = entry.getKey();               // in this label               if(tmpLabel.equals(label)) {                    Map<String, Map<String, Term>> labelledDocs = entry.getValue();                    for(Entry<String, Map<String, Term>> docEntry : labelledDocs.entrySet()) {                         // not containing this word                         if(!docEntry.getValue().containsKey(word)) {                              ++count;                         }                    }                    break;               }          }          return count;     }          private int getDocCountNotContainingWordNotInLabel(String word, String label) {          int count = 0;          Iterator<Entry<String,Map<String,Map<String,Term>>>> iter =                     context.getVectorMetadata().termTableIterator();          while(iter.hasNext()) {               Entry<String,Map<String,Map<String,Term>>> entry = iter.next();               String tmpLabel = entry.getKey();               // not in this label               if(!tmpLabel.equals(label)) {                    Map<String, Map<String, Term>> labelledDocs = entry.getValue();                    for(Entry<String, Map<String, Term>> docEntry : labelledDocs.entrySet()) {                         // not containing this word                         if(!docEntry.getValue().containsKey(word)) {                              ++count;                         }                    }               }          }          return count;     }}

输出量化数据文件

特征向量已经从训练数据集中计算出来，接下来需要对每个词给出一个唯一的编号，从1开始，这个比较容易，输出特征向量文件，为测试验证的数据集所使用，文件格式如下所示：

认识     1代理权     2病理     3死者     4影子     5生产国     6容量     7螺丝扣     8大钱     9壮志     10生态圈     11好事     12全人类     13

由于libsvm使用的训练数据格式都是数字类型的，所以需要对训练集中的文档进行量化处理，我们使用TF-IDF度量，表示词与文档的相关性指标。
然后，需要遍历已经构建好的词袋模型，并使用已经编号的类别和特征向量，对每个文档计算TF-IDF值，每个文档对应一条记录，取出其中一条记录，输出格式如下所示：

8 9219:0.24673737883635047 453:0.09884635754820137 10322:0.21501394457319623 11947:0.27282495932970074 6459:0.41385272697452935 46:0.24041607991272138 8987:0.14897255497578704 4719:0.22296154731520754 10094:0.13116443653818177 5162:0.17050804524212404 2419:0.11831944042647048 11484:0.3501901869096251 12040:0.13267440708284894 8745:0.5320327758892881 9048:0.11445287153209653 1989:0.04677087098649205 7102:0.11308242956243426 3862:0.12007217405755069 10417:0.09796211412332205 5729:0.148037967054332 11796:0.08409157900442304 9094:0.17368658217203461 3452:0.1513474608736807 3955:0.0656773581702849 6228:0.4356889927309336 5299:0.15060439516792662 3505:0.14379243687841153 10732:0.9593462052245719 9659:0.1960034406311122 8545:0.22597403804274924 6767:0.13871522631066047 8566:0.20352452713417019 3546:0.1136541497082903 6309:0.10475466997804883 10256:0.26416957780238604 10288:0.22549409383630933

第一列的8表示类别编号，其余的每一列是词及其权重，使用冒号分隔，例如“9219:0.24673737883635047”表示编号为9219的词，对应的TF-IDF值为0.24673737883635047。如果特征向量有个N个，那么每条记录就对应着一个N维向量。
对于测试数据集中的文档，也使用类似的方法，不过首先需要加载已经输出的特征向量文件，从而构建一个支持libsvm格式的输出测试集文件。

使用libsvm训练文本分类器

前面的准备工作已经完成，现在可以使用libsvm工具包训练文本分类器。在使用libsvm的开始，需要做一个尺度变换操作（有时也称为归一化），有利于libsvm训练出更好的模型。我们已经知道前面输出的数据中，每一维向量都使用了TF-IDF的值，但是TF-IDF的值可能在一个不规范的范围之内（因为它依赖于TF和IDF的值），例如0.19872~8.3233，所以可以使用libsvm将所有的值都变换到同一个范围之内，如0~1.0，或者-1.0~1.0，可以根据实际需要选择。我们看一下命令：

F:\libsvm-3.0\windows>svm-scale.exe -l 0 -u 1 C:\\Users\\thinkpad\\Desktop\\vector\\train.txt > C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt

尺度变换后输出到文件train-scale.txt中，它可以直接作为libsvm训练的数据文件，我使用Java版本的libsvm代码，输入参数如下所示：

train -h 0 -t 0 C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txt

这里面，-t 0表示使用线性核函数，我发现在进行文本分类时，线性核函数比libsvm默认的-t 2非线性核函数的效果要要好一些。最后输出的是模型文件model.txt，内容示例如下所示：

svm_type c_svckernel_type linearnr_class 10total_sv 54855rho -0.26562545584492675 -0.19596934447720876 0.24937032535471693 0.3391566771481882 -0.19541394291523667 -0.20017990510840347 -0.27349052681332664 -0.08694672836814998 -0.33057155365157015 0.06861675551386985 0.5815821822995312 0.7781870137763507 0.054722797451472065 0.07912846180263113 -0.01843419889020123 0.15110176721612528 -0.08484865489154271 0.46608205351462983 0.6643550487438468 -0.003914533674948038 -0.014576392246426623 -0.11384567944039309 0.09257404411884447 -0.16845445862600575 0.18053514069700813 -0.5510915276095857 -0.4885382860289285 -0.6057167948571457 -0.34910272249526764 -0.7066730463805829 -0.6980796972363181 -0.639435517196082 -0.8148772080348755 -0.5201121512955246 -0.9186975203736724 -0.008882360255733836 -0.0739010940085453 0.10314117392946448 -0.15342997221636115 -0.10028736061509444 0.09500443080371801 -0.16197536915675026 0.19553010464320583 -0.030005330377757263 -0.24521471309904422label 8 4 7 5 10 9 3 2 6 1nr_sv 6542 5926 5583 4058 5347 6509 5932 4050 6058 4850SV0.16456599916886336 0.22928285261208994 0.921277302054534 0.39377902901901013 0.4041207410447258 0.2561997963212561 0.0 0.0819993502684317 0.12652009525418703 9219:0.459459 453:0.031941 10322:0.27027 11947:0.0600601 6459:0.168521 46:0.0608108 8987:0.183784 4719:0.103604 10094:0.0945946 5162:0.0743243 2419:0.059744 11484:0.441441 12040:0.135135 8745:0.108108 9048:0.0440154 1989:0.036036 7102:0.0793919 3862:0.0577064 10417:0.0569106 5729:0.0972222 11796:0.0178571 9094:0.0310078 3452:0.0656566 3955:0.0248843 6228:0.333333 5299:0.031893 3505:0.0797101 10732:0.0921659 9659:0.0987654 8545:0.333333 6767:0.0555556 8566:0.375 3546:0.0853659 6309:0.0277778 10256:0.0448718 10288:0.388889... ...

上面，我们只是选择了非默认的核函数，还有其他参数可以选择，比如代价系数c，默认是1，表示在计算线性分类面时，可以容许一个点被分错。这时候，可以使用交叉验证来逐步优化计算，选择最合适的参数。
使用libsvm，指定交叉验证选项的时候，只输出经过交叉验证得到的分类器的精度，而不会输出模型文件，例如使用交叉验证模型运行时的参数示例如下：

-h 0 -t 0 -c 32 -v 5 C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txt

用-v启用交叉验证模式，参数-v 5表示将每一类下面的数据分成5份，按顺序1对2，2对3，3对4，4对5，5对1分别进行验证，从而得出交叉验证的精度。例如，下面是我们的10个类别的交叉验证运行结果：

Cross Validation Accuracy = 71.10428571428571%

在选好各个参数以后，就可以使用最优的参数来计算输出模型文件。

使用libsvm验证文本分类器精度

前面已经训练出来分类模型，就是最后输出的模型文件。现在可以使用测试数据集了，通过使用测试数据集来做对基于文本分类模型文件预测分类精度进行验证。同样，需要做尺度变换，例如：

F:\libsvm-3.0\windows>svm-scale.exe -l 0 -u 1 C:\\Users\\thinkpad\\Desktop\\vector\\test.txt > C:\\Users\\thinkpad\\Desktop\\vector\\test-scale.txt

注意，这里必须和训练集使用相同的尺度变换参数值。
我还是使用Java版本的libsvm进行预测，验证分类器精度，svm_predict类的输入参数：

C:\\Users\\thinkpad\\Desktop\\vector\\test-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txt C:\\Users\\thinkpad\\Desktop\\vector\\predict.txt

这样，预测结果就在predict.txt文件中，同时输出分类精度结果，例如：

Accuracy = 66.81% (6681/10000) (classification)

如果觉得分类器精度不够，可以使用交叉验证去获取更优的参数，来训练并输出模型文件，例如，下面是几组结果：

train -h 0 -t 0 C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txtAccuracy = 67.10000000000001% (6710/10000) (classification)train -h 0 -t 0 -c 32 -v 5 C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txtCross Validation Accuracy = 71.10428571428571%Accuracy = 66.81% (6681/10000) (classification)train -h 0 -t 0 -c 8 -m 1024 C:\\Users\\thinkpad\\Desktop\\vector\\train-scale.txt C:\\Users\\thinkpad\\Desktop\\vector\\model.txtCross Validation Accuracy = 74.3240057320121%Accuracy = 67.88% (6788/10000) (classification)

第一组是默认情况下c=1时的精度为 67.10000000000001%；
第二组使用了交叉验证模型，交叉验证精度为71.10428571428571%，获得参数c=32，使用该参数训练输出模型文件，基于该模型文件进行预测，最终的精度为66.81%，可见没有使用默认c参数值的精度高；
第三组使用交叉验证，精度比第二组高一些，输出模型后并进行预测，最终精度为67.88%，略有提高。
可以基于这种方式不断地去寻找最优的参数，从而使分类器具有更好的精度。

总结

文本分类有其特殊性，在使用libsvm分类，或者其他的工具，都不要忘记，有如下几点需要考虑到：

其实文本在预处理过程进行的很多过程对最后使用工具进行分类都会有影响。
最重要的就是文本特征向量的选择方法，如果文本特征向量选择的很差，即使再好的分类工具，可能训练得到的分类器都未必能达到更好。
文本特征向量选择的不好，在训练调优分类器的过程中，虽然找到了更好的参数，但是这本身可能会是一个不好的分类器，在实际预测中可以出现误分类的情况。
选择训练集和测试集，对整个文本预处理过程，以及使用分类工具进行训练，都有影响。

参考链接

本文转载自：简单之美

Leave a comment

利用Eclipse使用Java OpenCV（Using OpenCV Java with Eclipse） – DreamTea

2013年12月5日机器学习 Eclipse, Java, OpenCV smallroof

最近在上计算机视觉这门课程用到了OpenCV，于是找到了 "Using OpenCV Java with Eclipse" 这篇博文，是英文的，我将它翻译如下与大家分享

从2.4.4版本以后，OpenCV开始支持Java。在这个教程中我会教你在Eclipse下怎么部署环境来使用OpenCV Java（操作系统是Windows），这样你可以充分的利用Java中的垃圾回收机制和一些极其方便的机制，从而大大减少你的代码工作量和错误。现在我们开始。

配置Eclipse：

首先，从官方下载网站获得最新发布版的OpenCV（点我），然后提取它到一个简单的目录（注：这里目录中最好不要出现中文），如 C:\OpenCV-2.4.6\。我使用的版本是2.4.6，但是这些配置步骤在其他OpenCV版本中也是基本相同的。

现在，我会将OpenCV作为一个用户库配置进Eclipse，这样我们在开发项目时就不用每次都配置相关文件了。打开Eclipse，然后在菜单中选择Window->Preferences，如图：

Eclipse preferences

导航到Java->Build Path->User Libraries，然后点击New…

Creating a new library

输入一个名字，比如：OpenCV-2.4.6，作为你的新的库名。

Naming the new library

现在，选择你刚才创建的用户库，然后点击Add External JARs…

Adding external jar

浏览到 C:\OpenCV-2.4.6\build\java\ ，然后选择 opencv-246.jar（注：这里作者假设OpenCV安装在C盘根目录）。然后加载这个jar包，并展开 opencv-246.jar ，之后选择Native library location ，之后点击 Edit…

Selecting native library location 1

选择 External Folder… ，然后浏览选择文件夹 C:\OpenCV-2.4.6\build\java\x64（这里如果是32位操作系统则选择x86）

Selecting native library location 2

你的用户库配置应该看起来如下图：

Selecting native library location 2

在一个新的Java项目中测试我们的配置是否成功：

Creating new Java project

在 Java Settings 步骤时,在 Libraries 标签下，选择 Add Library… ，然后选择 OpenCV-2.4.6 ，最后点击 Finish 。

Adding user defined library 1

Adding user defined library 2

库文件应该看起来如下图：

Adding user defined library

现在你已经创建并且配置好了一个新的Java项目，现在让我们测试下。新建一个Java文件。下面是一个初试代码你可以尝试测试下：

import org.opencv.core.Core;import org.opencv.core.CvType;import org.opencv.core.Mat;public class Hello{   public static void main( String[] args )   {      System.loadLibrary( Core.NATIVE_LIBRARY_NAME );      Mat mat = Mat.eye( 3, 3, CvType.CV_8UC1 );      System.out.println( "mat = " + mat.dump() );   }}

当你的代码运行起来后，你应该能够看到会输出一个 3×3的单位矩阵。

Adding user defined library

现在搞定啦，无论何时你创建一个新的项目，仅仅需要加载一下你已经创建OpenCV用户库到你的项目中就可以正常运行了。享受你的开发之旅吧

注：因为我使用的Eclipse版本是英文版的，不太清楚中文版的，所以对一些按钮名称我就直接遵照原文放置的，有配图方便找到的。

本文转载自：博客园-原创精华区

Leave a comment

日志查询利器 Logstash和ElasticSearch

2013年12月4日搜索引擎 Elastic Search, Java smallroof

日志查询利器 Logstash和ElasticSearch

1、日志分散在各个不同的硬件设备上，特别是在分布式系统下，想找到一个日志，将是很困难的事情。

2、日志检索是一个比较麻烦的事情，通常工程师会采用grep等linux指令进行处理。但是跨时间段查询、计数等需求，需要更多的系统级别的指令综合处理才能完成。无形中增加了难度。

3、日志检索到后，可能有成千上万行，工程师需要对这部分数据再进行钻取，得到最后的检索结果。

1、Logstash是一个完全开源的工具，他可以对你的日志进行收集、分析，并将其存储供以后使用（如，搜索）。

2、Elasticsearch是个开源分布式搜索引擎，它的特点有：分布式，零配置，自动发现，索引自动分片，索引副本机制，restful风格接口，多数据源，自动搜索负载等。

http://es-cn.medcl.net/

3、kibana 也是一个开源和免费的工具，他可以帮助您汇总、分析和搜索重要数据日志并提供友好的web界面。

使用上述的三个组件，可以组成一个很好的查询引擎。

650) this.width=650;” src=”http://52ml.net/images/fe5de2b60241b2e0ec1881814a3cd2d3.png” alt=”Elasticsearch” class=”alignCenter” />

服务器准备：192.168.1.1（Nginx–日志发生文件夹） 192.168.1.2（中心服务器）

1、192.168.1.1上配置Nginx日志模式，重启

log_format logstash_json ‘{ “@timestamp”: “$time_iso8601〃, ‘

‘”remote_addr”: “$remote_addr”, ‘

‘”remote_user”: “$remote_user”, ‘

‘”body_bytes_sent”: “$body_bytes_sent”, ‘

‘”request_time”: “$request_time”, ‘

‘”status”: “$status”, ‘

‘”request”: “$request”, ‘

‘”request_method”: “$request_method”, ‘

‘”http_referrer”: “$http_referer”, ‘

‘”http_user_agent”: “$http_user_agent” } }’;

access_log logs/access.log logstash_json;

2、192.168.1.1 上部署Logstash（Agent）

下载Logstash（https://download.elasticsearch.org/logstash/logstash/logstash-1.2.2-flatjar.jar），并上传。

input{
file{
path => “/usr/local/nginx/logs/access.log”
type => nginx # This format tells logstash to expect ‘logstash’ json events from the file.
format => json_event}
}

output{
redis{
host => “192.168.1.2〃
port => 6371
data_type => “list”
key => “logstash”}
}

nohup java -jar logstash-1.2.2-flatjar.jar agent -f logstash.conf > nohup &
3、192.168.1.2 上部署Redis（不赘述，参考wiki）

4、192.168.1.2 上部署Logstash（Indexer）

input{
redis{
host => “127.0.0.1〃 # these settings should match the output of the agent
data_type => “list”
key => “logstash”
port => 6371 # We use the ‘json’ codec here because we expect to read # json events from redis.
codec => json}
}

output{
stdout{
debug => true
debug_format => “json”}
elasticsearch{
host => “127.0.0.1〃}
}
nohup java -jar logstash-1.2.2-flatjar.jar agent -f logstash.conf > nohup &
5、192.168.1.2 上部署ElasticSearch

下载：http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.5.tar.gz

6、192.168.1.2 上部署kibana

nohup java -jar logstash-1.2.2-flatjar.jar web > web.log &

650) this.width=650;” src=”http://52ml.net/images/cb9e4cb9c6f02bda250cd30c95a8b650.jpg” alt=”Logstash Search” width=”300″ height=”126″ style=”height:auto;width:auto;border:0px;” class=”alignCenter” />

本文章作者来自于talkingdata（北京腾云天下技术团队)

本文转载自：把爱投资给希望-51CTO技术博客

Leave a comment

WebMagic 0.4.1 发布，Java 爬虫框架

2013年11月28日搜索引擎 Java, 网络爬虫 smallroof

此次更新加强了Ajax抓取的功能，并进行了一些功能改进。同时引入了重要的脚本化功能"webmagic-script"，为今后的 WebMagic-Avalon计划做准备。

功能增强：

修复了抓取完页面后，Spider偶尔无法退出的问题。详细问题的分析，有兴趣的可以点这里查看。
将抽取正文的 SmartContentSelector中的算法改为哈工大的正文抽取算法 https://code.google.com/p/cx-extractor/ ，经过测试，有较好的效果。使用方法：Html.getSmartContent() 。
为Page加入了更多的Http信息，包括http状态码"Page.getStatusCode()"和未解析过的正文"Page.getRawText()"。
为Spider增加一些监控信息，包括抓取的页面数"Spider.getPageCount()"，运行状态"Spider.getStatus()"和执行线程数"Spider.getThreadAlive()"。

Ajax方面，在注解模式，引入了JsonPath表达式来进行抽取，示例代码：

public class AppStore {    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..trackName")    private String trackName;    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..description")    private String description;    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..userRatingCount")    private int userRatingCount;    @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..screenshotUrls",multi = true)    private List<String> screenshotUrls;    public static void main(String[] args) {        AppStore appStore = OOSpider.create(Site.me(), AppStore.class).<AppStore>get("http://itunes.apple.com/lookup?id=653350791&country=cn&entity=software");        System.out.println(appStore.trackName);        System.out.println(appStore.description);        System.out.println(appStore.userRatingCount);        System.out.println(appStore.screenshotUrls);    }}

JsonPath表达式的含义及具体用法看这里： http://www.oschina.net/p/jsonpath

WebMagic今后的目标是一个完整的产品，让即使不会编码的人也能通过简单脚本，完成基本的爬虫开发，并促进脚本分享。这就是 WebMagic-Avalon计划。大家可以查看https://github.com/code4craft/webmagic/issues/43 进行功能讨论，欢迎各种建议。

目前第一期是要做到脚本化，对应文档：

https://github.com/code4craft/webmagic/tree/master/webmagic-scripts

webmagic邮件组： https://groups.google.com/forum/#!forum/webmagic-java

本文转载自：开源中国新闻

Leave a comment

来点实用的Java NIO（四）

2013年11月26日搜索引擎 Java, Lucene smallroof

刚刚，在iteye博客里看了一下，散仙的几篇文章好像都挨在一起，不知道这篇文章发了之后，会不会刷屏，实在是不好意思啊。上班没时间，只能利用下班的时间，留在公司，给大家分享一些有用的技能，也算散仙比较懒吧，散仙实在不愿意晚上回到宿舍之后，再尼玛研究什么技术，我草，那样岂不真的成码农了。

好了，扯淡了几句，下面开始进入正题，本篇散仙要分享的关于NIO里面的文件锁的知识，文件锁的用处，在特定场景下，是非常有用的，那么在开始进行讲解之前，散仙，先借这个知识，来给大家分享一下文件锁在lucene里面发挥的巨大作用。如果想要学习lucene4.x的朋友们，可以参照散仙的博客学习哦。

我们都知道lucene的索引存储，可以放在跟操作系统有关的文件系统里的，而lucene呢，又是决不允许有多个线程同时进行并发写的操作的，那么lucene又不像一些关系型数据库可以采取，锁表，锁行等等一些加锁策略来同步写入操作，那它到底是采用是什么实现的呢？答案毫无疑问，就是采用我们的文件系统加锁策略来实现的，可能用过lucene，solr或es的一些朋友们，有时会发现在我们索引的目录里，莫名其妙的出现了一个以.lock结尾的锁文件，没错，当出现这个文件时，就已经证明你肯定进行过数据的写入操作，
那么可能有些读者还有个疑问，为什么我的锁文件一直都存在啊？我的服务都已经停了呀，它怎么还有呢，我可以不可以把它删除呀？….，别着急，听散仙慢慢的给你介绍下实际情况，大多数时候，我们执行完写入操作后，这个锁文件是会自动删除的，如果它没有自动删除，那么就可能出现如下的几种情况，第一，程序出异常，突然中断了，第二，在写入操作依旧进行的时候我们强制关闭服务了，第三，断电或系统崩溃了。第四，有可能我们写的程序，忘关闭流资源了，造成资源引用依然存在。由以上几种情况，都有可能造成.lock文件没有被lucene自动删除，这时候我们可以手动删除，当然删不删除它，都不影响我们正常使用程序的。你要觉得它留在那里，有损形象，那么你就放心大胆的把它删除吧，当然前提是，不要在有写入操作进行的时候，去删除它。

当然，上面只是散仙分析的一个在lucene中的锁案例，其实这个文件锁，在很多场景都大有用处，因为在某种程度上来说，它可以简洁完美的防止并发。

下面，散仙先给一张，测试用的锁文件目录：

测试，代码如下：

package com.filelock;import java.io.File;import java.io.FileOutputStream;import java.nio.channels.FileChannel;import java.nio.channels.FileLock;import java.nio.charset.StandardCharsets;import java.nio.file.Files;import java.nio.file.Paths;import java.util.List;/*** *  * @author 秦东亮 * 测试文件锁的功能 *  * **/public class MyLock {	public static void main2(String[] args)throws Exception {				File f=new File("D://6//mylock.lock");		f.delete();			}	 	  public static void main(String[] args)throws Exception {		 // List<String> s=Files.readAllLines(Paths.get("D://6//my.txt"), StandardCharsets.UTF_8);		  //System.out.println(s.size());		  //使用FileOutputStream获取channel		  FileOutputStream f=new FileOutputStream(new File("D://6//mylock.lock"));		  FileChannel channel=f.getChannel();		  		  //非阻塞加锁		   FileLock lock=channel.tryLock();		  //阻塞加锁		  // FileLock lock=channel.lock();		  if(lock==null){			  System.out.println("改程序已经被占用.....");			  System.out.println("阻塞中.....");			  		  }else{			  System.out.println("开始访问.......");			 Thread.sleep(5000);//5秒后进行访问			  if(lock!=null){				  List<String> s=Files.readAllLines(Paths.get("D://6//my.txt"), StandardCharsets.UTF_8);				  //读取文件，打印内容				  for(String ss:s){					  System.out.println(ss);				  }				  				  lock.release();//释放锁				  lock.close();//关闭资源				  f.close();//关闭流资源				  Files.delete(Paths.get("D://6//mylock.lock"));				  System.out.println("访问完毕删除锁文件");			  }		  }	  		  	}		}

运行时的目录状态，截图如下：

控制台打印效果如下：

开始访问.......�1�3有经验有关系有技术有资本oh了，你可以去创业了！访问完毕删除锁文件

最后，我们大家都可以测试一下，让sleep休眠的时间更长一点，然后启动一个程序去访问这个文件，就会报一个异常，该文件已经被占用什么的，现在，我们就可以利用文件锁，来防止写入的并发操作了，至于具体的什么场景，还跟各位大大的业务有关系了。不过在高并发的场景下，建议还是使用一些关系型数据库，或者一些Nosql来解决，做做缓存，负载均衡什么的。总之，一句话，具体场景，具体分析。BOSS们看的都是结果，不会关心你的过程。

好了，今天，散仙就先分享到这里了，感谢各位看官，能够坚持看到最后。

本文转载自：三劫散仙

Leave a comment

Apache Solr 初级教程（介绍、安装部署、Java接口、中文分词） – 欧阳妙晴

2013年11月22日搜索引擎 Java, Solr, 分词 smallroof

Apache Solr 介绍 Solr 是什么？

Solr 是一个开源的企业级搜索服务器，底层使用易于扩展和修改的Java 来实现。服务器通信使用标准的HTTP 和XML，所以如果使用Solr 了解Java 技术会有用却不是必须的要求。

Solr 主要特性有：强大的全文检索功能，高亮显示检索结果，动态集群，数据库接口和电子文档（Word ，PDF 等）的处理。而且Solr 具有高度的可扩展，支持分布搜索和索引的复制。

Lucene 是什么？

Lucene 是一个基于 Java 的全文信息检索工具包，它不是一个完整的搜索应用程序，而是为你的应用程序提供索引和搜索功能。Lucene 目前是 Apache Jakarta 家族中的一个开源项目。也是目前最为流行的基于 Java 开源全文检索工具包。参考： http://www.codesocang.com/jiaocheng/javajiaocheng/2013/0427/2358.html

目前已经有很多应用程序的搜索功能是基于 Lucene ，比如 Eclipse 帮助系统的搜索功能。Lucene 能够为文本类型的数据建立索引，所以你只要把你要索引的数据格式转化的文本格式，Lucene 就能对你的文档进行索引和搜索。

Solr VS Lucene

Solr 与Lucene 并不是竞争对立关系，恰恰相反Solr 依存于Lucene ，因为Solr 底层的核心技术是使用Apache Lucene 来实现的，简单的说Solr 是Lucene 的服务器化。需要注意的是Solr 并不是简单的对Lucene 进行封装，它所提供的大部分功能都区别于Lucene 。

安装搭建Solr 安装 Java 虚拟机

Solr 必须运行在Java1.5 或更高版本的Java 虚拟机中，运行标准Solr 服务只需要安装JRE 即可，但如果需要扩展功能或编译源码则需要下载JDK 来完成。可以通过下面的地址下载所需JDK 或JRE ：

安装中间件

Solr 可以运行在任何Java 中间件中，下面将以开源Apache Tomcat 为例讲解Solr 的安装、配置与基本使用。本文使用Tomcat5.5 解压版进行演示，可在下面地址下载最新版本

安装Apache Solr 下载最新的Solr

本文发布时Solr1.4 为最新的版本，下文介绍内容均针对该版本，如与Solr 最新版本有出入请以官方网站内容为准。Solr官方网站下载地址：

Solr 程序包的目录结构

build ：在solr 构建过程中放置已编译文件的目录。
client ：包含了一些特定语言调用Solr 的API 客户端程序，目前只有Ruby 可供选择，Java 客户端叫SolrJ 在src/solrj 中可以找到。
dist ：存放Solr 构建完成的JAR 文件、WAR 文件和Solr 依赖的JAR 文件。
example ：是一个安装好的Jetty 中间件，其中包括一些样本数据和Solr 的配置信息。
example/etc ：Jetty 的配置文件。
example/multicore ：当安装Slor multicore 时，用来放置多个Solr 主目录。
example/solr ：默认安装时一个Solr 的主目录。
example/webapps ：Solr 的WAR 文件部署在这里。
src/java ：Slor 的Java 源码。
src/scripts ：一些在大型产品发布时一些有用的Unix bash shell 脚本。
src/solrj ：Solr 的Java 客户端。
src/test ：Solr 的测试源码和测试文件。
src/webapp ：Solr web 管理界面。管理界面的Jsp 文件都放在web/admin/ 下面，可以根据你的需要修改这些文件。

Solr 的源码没有放在同一个目录下，src/java 存放大多数文件，src/common 是服务器端与客户端公用的代码，src/test 放置solr 的测试程序，serlvet 的代码放在src/webapp/src 中。

Solr 主目录结构

一个运行的Solr 服务其主目录包含了Solr 的配置文件和数据（Lucene 的索引文件）

Solr 的主目录展开后为如下结构：

bin ：建议将集群复制脚本放在这个目录下。
conf/schema.xml ：建立索引的schema 包含了字段类型定义和其相关的分析器。
conf/solrconfig.xml ：这个是Solr 主要的配置文件。
conf/xslt ：包含了很多xslt 文件，这些文件能将Solr 的XML 的查询结果转换为特定的格式，比如：Atom/RSS。
data ：放置Lucene 产生的索引数据。
lib ：放置可选的JAR 文件比如对Slor 扩展的插件，这些JAR 文件将会在Solr 启动时加载。

如何设置主目录

通过Java system property ，属性的名称是：solr.solr.home 。
通过JNDI 将主目录的路径绑定到java:comp/env/solr/home 。
通过修改web.xml 位置在：src/web-app/web/WEB-INF ，

如果Solr 主目录没有指定则默认设置为solr/

发布运行Solr

将apache-solr-1.4.0/dist/apache-solr-1.4.0.war 从安装包中解压到<tomcat home>/webapps 下。WAR 是一个完整的web 应用程序，包括了Solr 的Jar 文件和所有运行Solr 所依赖的Jar 文件，Jsp 和很多的配置文件与资源文件，这里需要注意的是：WAR 文件中不包括Solr 主目录，因此在启动tomcat 之前我们要先指定Solr 的主目录。

将安装程序中的apache-solr-1.4.0/example/solr 文件夹解压到<tomcat homt>/ 下，然后在<tomcat home>/bin/catalina.bat 第一行添加如下内容：

set JAVA_OPTS = % JAVA_OPTS % -Dsolr.solr.home= < tomcat home >/ solr

注：Windows 以外操作系统需修改 catalina.sh 文件。

启动tomcat ，apache-solr-1.4.0.war 自动发布为web 应用。

标签：源码搜藏爱看电影网步步惊情古剑奇谭

使用Java 接口访问Solr 服务

SolrJ 是Solr 服务器的一个Java 接口，使用该接口再也不同为虑客户端与服务器端交互时格式解析和转换的问题烦恼了，取而代之的是用你熟悉的对象来进行相关的操作，而且随着Solr 的不断升级SolrJ 也会同样提供这些新加入的功能。

SolrJ （Solr1.4 ）依赖的Jar 包创建 SolrServer

SolrJ 中有2 种SolrServer 对象，CommonsHttpSolrServer 与EmbeddedSolrServer ，他们都是线程安全的并建议使用单例模式来使用他们，因为动态创建会造成连接泄露。

Create CommonsHttpSolrServer

Create EmbeddedSolrServer

1 2 3 4 5 , CoreContainer.CoreContainer coreContainer EmbeddedSolrServer server

添加

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 CommonsHttpSolrServer SolrServer server server.SolrInputDocument doc1 doc1., "id1" , 1.0f ) ; doc1., "doc1" , 1.0f ) ; doc1., SolrInputDocument doc2 doc2., "id2" , 1.0f ) ; doc2., "doc2" , 1.0f ) ; doc2., Collectiondocs.docs.server.server.

查询

1 2 3 4 5 6 7 8 // 得到一个 SolrServer 实例（通过上面介绍的方法创建） SolrServer server SolrQuery query query.query., SolrQuery.SolrDocumentList docs

中文分词分词产品

目前Lucene 的中文分词主要有：

paoding ：Lucene 中文分词“庖丁解牛” Paoding Analysis 。

分词效率

下面是各个分词产品官方提供的数据：

paoding ：在PIII 1G 内存个人机器上，1 秒可准确分词 100 万汉字。
imdict ：483.64 ( 字节/ 秒) ，259517( 汉字/ 秒) 。
mmseg4j ： complex 1200kb/s 左右, simple 1900kb/s 左右。
ik ：具有 50 万字 / 秒的高速处理能力。

自定义词库 ik 与 solr 集成

以上产品中只有 ik 提供了 Solr （ 1.3 ， 1.4 ）的分词接口程序，只需修改配置文件即可实现中文分词，配置方法如下；

使用 IKAnalyzer 的配置

1 2 3 4 5 6 7 = "1.1" > …… == …… </schema >

使用 IKTokenizerFactory 的配置

1 2 3 4 5 6 7 8 9 10 === …… == …… </analyzer > </fieldType >

本文转载自：博客园-Java

Leave a comment

WebMagic 0.4.0 发布，Java爬虫框架

2013年11月7日搜索引擎 Java, 网络爬虫 smallroof

此次更新主要对下载模块进行了优化，并增加了同步下载的API，同时对代码进行了一些重构。

一、Downloader部分更新：

升级HttpClient到4.3.1，重写了HttpClientDownloader的代码 #32 。
在http请求中主动开启gzip，降低传输开销 #31 。
修复0.3.2及之前版本连接池不生效的问题 #30 ，使用 HttpClient 4.3.1 新的连接池机制，实现连接复用功能。

经测试，下载速度可达到90%左右的提升。测试代码： Kr36NewsModel.java 。

二、增加同步抓取的API，对于小规模的抓取任务更方便：

OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");System.out.println(baike);

三、Site(配置类)增加更多配置项：

支持Http代理: Site.setHttpProxy #22 。
支持自定义所有Http头：Site.addHeader #27 。
支持开启和关闭gzip：Site.setUseGzip(false) 。
Site.addStartUrl移到了Spider.addUrl，因为作者认为startUrl应该是Spider的属性，而不是Site的属性。

四、Spider (主逻辑)进行了一些重构：

重写了多线程逻辑，代码更加易懂了，同时修复了一些线程安全问题。
引入了Google Guava API，让代码更简洁。
增加配置Spider.setSpawnUrl(false)，此选项为false时，只下载给定的url，不下载任何新发现的url。
可以给初始url中附带更多信息：Spider.addRequest #29

另外webmagic邮件组成立了，欢迎加入： https://groups.google.com/forum/#!forum/webmagic-java

本文转载自：开源中国新闻

Leave a comment

[NLP自然语言处理]读取UTF8字符并实现汉字和单词的识别，计算熵和KL距离 – McQueen1987

2013年11月4日 nlp Java, 自然语言处理 smallroof

  1 import java.io.BufferedReader;  2 import java.io.FileInputStream;  3 import java.io.FileReader;  4 import java.io.FileWriter;  5 import java.util.HashMap;  6 import java.util.Iterator;  7 import java.util.Map.Entry;  8 import java.util.regex.Matcher;  9 import java.util.regex.Pattern; 10  11 public class NLPFileUnit { 12     public HashMap<String, Integer> WordOccurrenceNumber;//The Occurrence Number of the single Chinese character 13     //or Single English word in the file 14     public HashMap<String, Float> WordProbability;//The probability of single Chinese character or English word 15     public HashMap<String, Integer> Punctuations;//The punctuation that screened out from the file  16     public float entropy;//熵，本文主要计算单个汉字，或者单个英文单词的熵值 17     private String filePath; 18  19     //构造函数 20     public NLPFileUnit(String filePath) throws Exception { 21         this.filePath = filePath; 22         WordOccurrenceNumber = createHash(createReader(filePath)); 23         Punctuations = filterPunctuation(WordOccurrenceNumber); 24         WordProbability = calProbability(WordOccurrenceNumber); 25         this.entropy = calEntropy(this.WordProbability); 26  27         System.out.println("all punctuations were saved at " + filePath.replace(".", "_punctuation.") + "!"); 28         this.saveFile(Punctuations, filePath.replace(".", "_punctuation.")); 29         System.out.println("all words(En & Ch) were saved at " + filePath.replace(".", "_AllWords.") + "!"); 30         this.saveFile(this.WordOccurrenceNumber, filePath.replace(".", "_AllWords.")); 31     } 32  33     /** 34      * get the English words form the file to HashMap 35      * @param hash 36      * @param path 37      * @throws Exception 38      */ 39     public void getEnWords(HashMap<String, Integer> hash, String path) throws Exception { 40         FileReader fr = new FileReader(path); 41         BufferedReader br = new BufferedReader(fr); 42          43         //read all lines into content 44         String content = ""; 45         String line = null; 46         while((line = br.readLine())!=null){ 47             content+=line; 48         } 49         br.close(); 50          51         //extract words by regex正则表达式 52         Pattern enWordsPattern = Pattern.compile("([A-Za-z]+)"); 53         Matcher matcher = enWordsPattern.matcher(content); 54         while (matcher.find()) { 55             String word = matcher.group(); 56             if(hash.containsKey(word)) 57                 hash.put(word, 1 + hash.get(word)); 58             else{ 59                 hash.put(word, 1); 60             } 61         } 62     } 63  64     private boolean isPunctuation(String tmp) { 65         //Punctuation should not be EN words/ Chinese 66         final String cnregex = "\\p{InCJK Unified Ideographs}"; 67         final String enregex = "[A-Za-z]+";  68         return !(tmp.matches(cnregex) || tmp.matches(enregex)) ; 69     } 70  71     /** 72      * judge whether the file is encoded by UTF-8 (UCS Transformation Format)format. 73      * @param fs 74      * @return 75      * @throws Exception 76      */ 77     private boolean isUTF8(FileInputStream fs) throws Exception { 78         if (fs.read() == 0xEF && fs.read() == 0xBB && fs.read() == 0xBF)//所有utf8编码的文件前三个字节为0xEFBBBF 79             return true; 80         return false; 81     } 82  83     /** 84      * utf8格式编码的字符，其第一个byte的二进制编码可以判断该字符的长度（汉字一般占三个字节）ASCII占一byte 85      * @param b 86      * @return 87      */ 88     private int getlength(byte b) { 89         int v = b & 0xff;//byte to 十六进制数 90         if (v > 0xF0) { 91             return 4; 92         } 93         // 110xxxxx 94         else if (v > 0xE0) { 95             return 3; 96         } else if (v > 0xC0) { 97             return 2;//该字符长度占2byte 98         } 99         return 1;100     }101 102     /**103      * 通过读取头一个byte来判断该字符占用字节数，并读取该字符，如1110xxxx，表示这个字符占三个byte104      * @param fs105      * @return106      * @throws Exception107      */108     private String readUnit(FileInputStream fs) throws Exception {109         byte b = (byte) fs.read();110         if (b == -1)111             return null;112         int len = getlength(b);113         byte[] units = new byte[len];114         units[0] = b;115         for (int i = 1; i < len; i++) {116             units[i] = (byte) fs.read();117         }118         String ret = new String(units, "UTF-8");119         return ret;120     }121 122     /**123      * 把单词，标点，汉字等全都读入hashmap124      * @param inputStream125      * @return126      * @throws Exception127      */128     private HashMap<String, Integer> createHash(FileInputStream inputStream)129             throws Exception {130         HashMap<String, Integer> hash = new HashMap<String, Integer>();131         String key = null;132         while ((key = readUnit(inputStream)) != null) {133             if (hash.containsKey(key)) {134                 hash.put(key, 1 + (int) hash.get(key));135             } else {136                 hash.put(key, 1);137             }138         }139         inputStream.close();140         getEnWords(hash, this.filePath);141         return hash;142     }143 144     /**145      * FileInputStream读取文件，若文件不是UTF8编码，返回null146      * @param path147      * @return148      * @throws Exception149      */150     private FileInputStream createReader(String path) throws Exception {151         FileInputStream br = new FileInputStream(path);152         if (!isUTF8(br))153             return null;154         return br;155     }156 157     /**158      * save punctuation filtered form (HashMap)hash into (HashMap)puncs,159      * @param hash;remove punctuation form (HashMap)hash at the same time160      * @return161      */162     private HashMap<String, Integer> filterPunctuation(163             HashMap<String, Integer> hash) {164         HashMap<String, Integer> puncs = new HashMap<String, Integer>();165         Iterator<?> iterator = hash.entrySet().iterator();166 167         while (iterator.hasNext()) {168             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();169             String key = entry.getKey().toString();170             if (isPunctuation(key)) {171                 puncs.put(key, hash.get(key));172                 iterator.remove();173             }174         }175         return puncs;176     }177 178     /**179      * calculate the probability of the word in hash180      * @param hash181      * @return182      */183     private HashMap<String, Float> calProbability(HashMap<String, Integer> hash) {184         float count = countWords(hash);185         HashMap<String, Float> prob = new HashMap<String, Float>();186         Iterator<?> iterator = hash.entrySet().iterator();187         while (iterator.hasNext()) {188             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();189             String key = entry.getKey().toString();190             prob.put(key, hash.get(key) / count);191         }192         return prob;193     }194 195     /**196      * save the content in the hash into file.txt197      * @param hash198      * @param path199      * @throws Exception200      */201     private void saveFile(HashMap<String, Integer> hash, String path)202             throws Exception {203         FileWriter fw = new FileWriter(path);204         fw.write(hash.toString());205         fw.close();206     }207 208     /**209      * calculate the total words in hash210      * @param hash211      * @return212      */213     private int countWords(HashMap<String, Integer> hash) {214         int count = 0;215         for (Entry<String, Integer> entry : hash.entrySet()) {216             count += entry.getValue();217         }218         return count;219     }220 221     /**222      * calculate the entropy（熵） of the characters223      * @param hash224      * @return225      */226     private float calEntropy(HashMap<String, Float> hash) {227         float entropy = 0;228         Iterator<Entry<String, Float>> iterator = hash.entrySet().iterator();229         while (iterator.hasNext()) {230             Entry<String, Float> entry = (Entry<String, Float>) iterator.next();231             Float prob = entry.getValue();//get the probability of the characters232             entropy += 0 - (prob * Math.log(prob));//calculate the entropy of the characters233         }234         return entropy;235     }236 }237 238 239 240 241 242 243 244 import java.io.BufferedReader;245 import java.io.FileNotFoundException;246 import java.io.IOException;247 import java.io.InputStreamReader;248 import java.util.HashMap;249 import java.util.Iterator;250 import java.util.Map.Entry;251 252 public class NLPWork {253  254     /**255      * calculate the KL distance form file u1 to file u2256      * @param u1257      * @param u2258      * @return259      */260     public static float calKL(NLPFileUnit u1, NLPFileUnit u2) {261         HashMap<String, Float> hash1 = u1.WordProbability;262         HashMap<String, Float> hash2 = u2.WordProbability;263         float KLdistance = 0;264         Iterator<Entry<String, Float>> iterator = hash1.entrySet().iterator();265         while (iterator.hasNext()) {266             Entry<String, Float> entry = iterator.next();267             String key = entry.getKey().toString();268 269             if (hash2.containsKey(key)) {270                 Float value1 = entry.getValue();271                 Float value2 = hash2.get(key);272                 KLdistance += value1 * Math.log(value1 / value2);273             }274         }275         return KLdistance;276     }277 278     public static void main(String[] args) throws IOException, Exception {279         //all punctuation will be saved under working directory280         System.out.println("Now only UTF8 encoded file is supported!!!");281         System.out.println("PLS input file 1 path:");282         BufferedReader cin = new BufferedReader(283                 new InputStreamReader(System.in));284         String file1 = cin.readLine();285         System.out.println("PLS input file 2 path:");286         String file2 = cin.readLine();287         NLPFileUnit u1 = null;288         NLPFileUnit u2 = null;289         try{290             u1 = new NLPFileUnit(file1);//NLP:Nature Language Processing291             u2 = new NLPFileUnit(file2);292         }293         catch(FileNotFoundException e){294             System.out.println("File Not Found!!");295             e.printStackTrace();296             return;297         }298         float KLdistance = calKL(u1, u2);299         System.out.println("KLdistance is :" + KLdistance);300         System.out.println("File 1 Entropy: " + u1.entropy);301         System.out.println("File 2 Entropy: " + u2.entropy);302     }303 }

本文转载自：博客园-原创精华区

Leave a comment

分布式搜索引擎ElasticSearch(三) — Java API CRUD

2013年10月18日搜索引擎 Elastic Search, Java, 分布式系统 smallroof

Jason :

具体请参考ElasticSearch官方 java api ,以下代码请到我的 GitHub 查看。

客户端连接（client and node）

//node clientNode node = NodeBuilder.nodeBuilder().node();Client client = node.client();                                   Node node = NodeBuilder.nodeBuilder().clusterName("").client(true).node();                                               Node node = NodeBuilder.nodeBuilder().local(true).node();node.close();                                                                      //transport clientSettings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "elasticsearch").build();Client client = new TransportClient(settings)       .addTransportAddress(new InetSocketTransportAddress("192.168.1.108", 9300));                                      Client client = new TransportClient()           .addTransportAddress(new InetSocketTransportAddress("192.168.1.108", 9300));                                      client.close();

索引（index）

XContentBuilder builder = XContentFactory.jsonBuilder()            .startObject()            .field("user", "tom")            .field("postDate", new Date())            .field("message", "hello tom!")            .field("age", 24)            .endObject();        IndexResponse response = client.prepareIndex("twitter", "tweet", "13")                                    .setSource(builder)                                    .execute()                                    .actionGet();

查询（get）

GetResponse getResponse = client.prepareGet("twitter", "tweet", "13")                                    .execute()                                    .actionGet();                                       Map<String, Object> map = getResponse.getSource();                                       Iterator<Entry<String, Object>> it = map.entrySet().iterator();        while (it.hasNext()) {            Entry<String, Object> e = it.next();            logger.info("Key:{}; Value:{}", e.getKey(),e.getValue());        }

删除（delete）

DeleteResponse deleteResponse = client.prepareDelete("twitter", "tweet", "1")                                    .execute()                                    .actionGet();

批量处理（Bulk）

XContentBuilder builder0 = XContentFactory.jsonBuilder()                                    .startObject()                                    .field("user", "jason ling")                                    .field("postDate", new Date())                                    .field("message", "hello jason ling!")                                    .field("age", 30)                                    .endObject();        XContentBuilder builder1 = XContentFactory.jsonBuilder()                                    .startObject()                                    .field("user", "kaven chen")                                    .field("postDate", new Date())                                    .field("message", "hello kaven chen!")                                    .field("age", 26)                                    .endObject();                                 BulkRequestBuilder bulkRequest = client.prepareBulk();        bulkRequest.add(client.prepareIndex("twitter", "tweet", "7").setSource(builder0))                    .add(client.prepareIndex("twitter", "tweet", "8").setSource(builder1));                                 BulkResponse bulkResponse =  bulkRequest.execute().actionGet();                                 if(bulkResponse.hasFailures()){                                     }

搜索（search）

SearchResponse searchResponse = client.prepareSearch("twitter")                                            .setTypes("tweet")                                            .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)                                            //.setQuery(QueryBuilders.termQuery("user", "jason"))             //Query                                             //.setQuery(QueryBuilders.fieldQuery("user", "jason"))                                            .setQuery(QueryBuilders.queryString("jason"))                                            .setFilter(FilterBuilders.rangeFilter("age").from(24).to(30))   //Filter                                                .setFrom(0).setSize(10).setExplain(true)                        //Page                                              .execute()                                            .actionGet();                               //SearchResponse searchResponse = client.prepareSearch().execute().actionGet();                                   SearchHits hits = searchResponse.getHits();         long total = hits.getTotalHits();        logger.info("search result total:{}",total);                               for (SearchHit hit : hits) {            Object user = hit.getSource().get("user");            Object postDate = hit.getSource().get("postDate");            Object message = hit.getSource().get("message");            Object age = hit.getSource().get("age");            logger.info("user:{},postDate:{},message:{},age:{}",user,postDate,message,age);        }

facets

SearchResponse sr = client.prepareSearch()            .setQuery(QueryBuilders.matchAllQuery())            .addFacet(FacetBuilders.termsFacet("f1").field("age").size(10))            .addFacet(FacetBuilders.dateHistogramFacet("f2").field("postDate").interval("year"))            .execute().actionGet();                             TermsFacet f1 = (TermsFacet) sr.getFacets().facetsAsMap().get("f1");                             long totalCount = f1.getTotalCount();      // Total terms doc count        long otherCount = f1.getOtherCount();      // Not shown terms doc count        long missingCount = f1.getMissingCount();    // Without term doc count                             logger.info("totalCount is {}; otherCount is {}; missingCount is {}",totalCount,otherCount,missingCount);        // For each entry        for (TermsFacet.Entry entry : f1) {            entry.getTerm();    // Term            entry.getCount();   // Doc count            logger.info("Term is {} ; Doc count is {}",entry.getTerm(),entry.getCount());        }                                                  DateHistogramFacet f2 = (DateHistogramFacet) sr.getFacets().facetsAsMap().get("f2");        String name = f2.getName();        String type = f2.getType();        logger.info("name is {} ; type is {}",name,type);        for (DateHistogramFacet.Entry entry : f2.getEntries()) {            logger.info("Count is {} ",entry.getCount());        }

GeoDistanceFacets

GeoDistanceFacetBuilder gdfb = FacetBuilders.geoDistanceFacet("f")                                        .field("location")              // Field containing coordinates we want to compare with                                        .point(40, -70)                     // Point from where we start (0)                                        .addUnboundedFrom(10)               // 0 to 10 km (excluded)                                        .addRange(10, 20)                   // 10 to 20 km (excluded)                                        .addRange(20, 100)                  // 20 to 100 km (excluded)                                        .addUnboundedTo(100)                // from 100 km to infinity (and beyond   )                                        .unit(DistanceUnit.KILOMETERS);     // All distances are in kilometers. Can be MILES

转载自： Jason

本文转载自： I'm Linxs

Leave a comment

第二步在D2RQ平台上配置jena环境 – aniuer

2013年10月17日机器学习 Java, RDF smallroof

第二步在D2RQ平台上配置jena环境

2013年10月16日 9:48:53

搞了这么长时间语义，只用过protege这样的工具，一直没有落实到实际代码上。jena也看过好久了，总认为是hp公司的东西算不上标准，现在看来，jena已经是语义应用开发中最主流的工具了，我就此选择jena进行语义应用的开发了。

Getting started with Apache Jena

Apache Jena (or Jena in short) 是免费开源的java框架用于构建语义网应用和关联数据应用。这个框架包括不同的处理RDF数据的API接口。如果你是菜鸟，你可以选择从下面的教程中开始，你也可以浏览感兴趣的主题文档。

Tutorials

竟然发现有大牛已经做过翻译了，好吧，我转载在这里，我郑重声明转载了“ april 1019 ”这位大牛的工作，

Jena 文档《 An Introduction to RDF and the Jena RDF API 》的译文

文档里包含的内容很多，还是回到具体的配置上来。

Using the D2RQ Engine with Jena

1. Jena Versions

由于D2RQ内部嵌入了jena和SPARQL查询引擎，所以对jena和ARQ是敏感的。D2RQ只在相应的jena版本下工作。检索/lib/arq-X.Y目录下各包的版本，以下载合适的jena包。

我下载的D2RQ是0.8.1，其/lib里面的jena的jar是2.7.0，所以从 http://archive.apache.org/dist/jena/binaries/ 这里下载了apache-jena-2.7.0-incubating.tar.gz。还是解压到/opt文件夹下。

2. 配置路径

新建一个eclipse工程jena_test，然后右键-Build Path-Add Library-User Library，再把/opt/D2RServer/d2rq-0.8.1/lib下的d2rq-0.8.1.jar，添加入项目中。类似再添加commons-logging-1.1.jar和slf4j-api-1.6.4.jar到项目中，然后把/lib/db-drivers下的JDBC也加进来。

D2RQ下载时会忽略一些Jena/ARQ jar，但这些jar也有可以会用到。可以把下载的放到这个lib中去。

3. Logging

D2RQ通过Apache Commons Logging的API记录日志信息。D2RQ搭载了 Apache log4j （见注1），但你也可以使用不同的前端日志。

为了获取D2RQ的调试信息，设置日志记录器的级别从 de.fuberlin.wiwiss.d2rq 改为 ALL. 简单的方法即是把 /lib/logging 的目录添加到构建路径中，新建一个文件 log4j.properties，包括以下内容：

log4j.rootLogger=INFO, stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender

log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

log4j.appender.stdout.layout.ConversionPattern=%d{HH:mm:ss} %-5p %-20c{1} :: %m%n

log4j.logger.de.fuberlin.wiwiss.d2rq=ALL

我在/lib/logging下新建了 log4j.properties 文件，复制了上面的内容，并把它也加到了项目中。

4. 通过jena模型的API使用D2RQ

ModelD2RQ 类为D2RQ映射数据库中的数据提供了一个Jena的模型视图。

下面的例子通过使用一个预先创建的映射文件介绍 ModelD2RQ 是如何创建的，也介绍了如何用Jena API从模型中提取论文及作者信息。

// Set up the ModelD2RQ using a mapping file

Model m = new ModelD2RQ("file:doc/example/mapping-iswc.ttl");

// Find anything with an rdf:type of iswc:InProceedings

StmtIterator paperIt = m.listStatements(null, RDF.type, ISWC.InProceedings);

// List found papers and print their titles

while (paperIt.hasNext()) {

Resource paper = paperIt.nextStatement().getSubject();

System.out.println("Paper: " + paper.getProperty(DC.title).getString());

// List authors of the paper and print their names

StmtIterator authorIt = paper.listProperties(DC.creator);

while (authorIt.hasNext()) {

Resource author = authorIt.nextStatement().getResource();

System.out.println("Author: " + author.getProperty(FOAF.name).getString());

}

System.out.println();

}

m.close();

其中ISWC和FOAF类在Jena的 schemagen 工具中创建了， DC and RDF类也是Jena的一部分。

5. 通过jena图的API使用D2RQ

在一些情况中，最好用低水平的Jena图的API，而不是模型API，D2RQ提供图的接口的实现，即GraphD2RQ.

下面的例子是介绍图API来查找2003年发表的所有论文。

// Load mapping file

Model mapModel = FileManager.get().loadModel("doc/example/mapping-iswc.ttl");

// Parse mapping file

MapParser parser = new MapParser(mapModel, "http://localhost:2020/");

Mapping mapping = parser.parse();

// Set up the GraphD2RQ

GraphD2RQ g = new GraphD2RQ(mapping);

// Create a find(spo) pattern

Node subject = Node.ANY;

Node predicate = DC.date.asNode();

Node object = Node.createLiteral("2003", null, XSDDatatype.XSDgYear);

Triple pattern = new Triple(subject, predicate, object);

// Query the graph

Iterator<Triple> it = g.find(pattern);

// Output query results

while (it.hasNext()) {

Triple t = (Triple) it.next();

System.out.println("Published in 2003: " + t.getSubject());

};

g.close();

5.1 还有一个CachingGraphD2RQ

除了CachingGraphD2RQ，还有一个CachingGraphD2RQ能够提供相同的API，使用LRU(最近最少使用算法)

缓存记忆最近几个查询的结果。这会改进重复查询的效果，如果数据库在CachingGraphD2RQ类的使用期间发生改变的话，会报出数据的不一致性。

6. Executing SPARQL queries against a ModelD2RQ

D2RQ可以通过D2RQ模型响应SPARQL查询，下面的例子介绍D2RQ模型是如何创建的，以及SPARQL查询是如何执行的、结果是如何写入控制台的。

ModelD2RQ m = new ModelD2RQ("file:doc/example/mapping-iswc.ttl");

String sparql =

"PREFIX dc: <http://purl.org/dc/elements/1.1/>" +

"PREFIX foaf: <http://xmlns.com/foaf/0.1/>" +

"SELECT ?paperTitle ?authorName WHERE {" +

" ?paper dc:title ?paperTitle . " +

" ?paper dc:creator ?author ." +

" ?author foaf:name ?authorName ." +

"}";

Query q = QueryFactory.create(sparql);

ResultSet rs = QueryExecutionFactory.create(q, m).execSelect();

while (rs.hasNext()) {

QuerySolution row = rs.nextSolution();

System.out.println("Title: " + row.getLiteral("paperTitle").getString());

System.out.println("Author: " + row.getLiteral("authorName").getString());

};

m.close();

7. The D2RQ Assembler

D2RQ自带Jena编译器，Jena编译器的规格是RDF的配置文件，文件描述了如何构建一个Jena模型。更多的Jena编译器的信息，请看 Jena Assembler quickstart page .

下面的例子介绍了D2RQ模型的编译规格

@prefix : <#> .

@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .

<> ja:imports d2rq: .

:myModel

a d2rq:D2RQModel;

d2rq:mappingFile <mapping-iswc.ttl>;

d2rq:resourceBaseURI <http://localhost:2020/>;

D2RQ模型规格支持下面两种特性：

d2rq:mappingFile：必要的，使用D2RQ的映射文件的URI来构建模型。

d2rq:resourceBaseURI：设置基础URI来代替相对URI模型，转变为全URI。如果不特指的话，D2RQ会特选择一个适当的基础URI。

下面的这种惯例用法，将从一个模型规范中创建一个D2RQ模型，并把它写入控制台。

// Load assembler specification from file

Model assemblerSpec = FileManager.get().loadModel("doc/example/assembler.ttl");

// Get the model resource

Resource modelSpec = assemblerSpec.createResource(assemblerSpec.expandPrefix(":myModel"));

// Assemble a model

Model m = Assembler.general.openModel(modelSpec);

// Write it to System.out

m.write(System.out);

m.close();

8. Javadoc API documentation

Javadoc API documentation for the latest release is available .

注1：Log4j是Apache的一个开放源代码项目，通过使用Log4j，我们可以控制日志信息输送的目的地是控制台、文件、GUI组件、甚至是套接口服务器、NT的事件记录器、UNIX Syslog守护进程等；我们也可以控制每一条日志的输出格式；通过定义每一条日志信息的级别，我们能够更加细致地控制日志的生成过程。最令人感兴趣的就是，这些可以通过一个配置文件来灵活地进行配置，而不需要修改应用的代码。

第二步就写到这里，供大家参考。有需要交流的可以加QQ: 1q7q1q5q3q6q0q1q8（去掉中间的q）。