Lucene入门学习

最新推荐文章于 2024-07-13 13:55:40 发布

weixin_30873847

最新推荐文章于 2024-07-13 13:55:40 发布

阅读量52

点赞数

文章标签： java

原文链接：http://www.cnblogs.com/xiehongwei/p/7010367.html

版权

参考博客： http://blog.csdn.net/ayi_5788/article/category/6348409

分页： http://blog.csdn.net/hu948162999/article/details/41209699

1、什么是中文分词

　　学过英文的都知道，英文是以单词为单位的，单词与单词之间以空格或者逗号句号隔开。而中文则以字为单位，字又组成词，字和词再组成句子。所以对于英文，我们可以简单以空格判断某个字符串是否为一个单词，比如I love China，love 和 China很容易被程序区分开来；但中文“我爱中国”就不一样了，电脑不知道“中国”是一个词语还是“爱中”是一个词语。把中文的句子切分成有意义的词，就是中文分词，也称切词。我爱中国，分词的结果是：我爱中国。

　　目前中文分词还是一个难题———对于需要上下文区别的词以及新词（人名、地名等）很难完美的区分。国际上将同样存在分词问题的韩国、日本和中国并称为CJK(Chinese Japanese Korean)，对于CJK这个代称可能包含其他问题，分词只是其中之一。

2、中文分词的实现

　　Lucene中对中文的处理是基于自动切分的单字切分，或者二元切分。除此之外，还有最大切分（包括向前、向后、以及前后相结合）、最少切分、全切分等等。

　　Lucene自带了几个分词器WhitespaceAnalyzer， SimpleAnalyzer， StopAnalyzer， StandardAnalyzer， ChineseAnalyzer， CJKAnalyzer等。前面三个只适用于英文分词，StandardAnalyzer对可最简单地实现中文分词，即二分法，每个字都作为一个词，比如：”北京天安门” ==> “北京京天天安安门”。这样分出来虽然全面，但有很多缺点，比如，索引文件过大，检索时速度慢等。ChineseAnalyzer是按字分的,与StandardAnalyzer对中文的分词没有大的区别。 CJKAnalyzer是按两字切分的, 比较武断,并且会产生垃圾Token，影响索引大小。以上分词器过于简单，无法满足现实的需求，所以我们需要实现自己的分词算法。

　　这样，在查询的时候，无论是查询”北京” 还是查询”天安门”，将查询词组按同样的规则进行切分：”北京”，”天安安门”，多个关键词之间按与”and”的关系组合，同样能够正确地映射到相应的索引中。这种方式对于其他亚洲语言：韩文，日文都是通用的。

　　基于自动切分的最大优点是没有词表维护成本，实现简单，缺点是索引效率低，但对于中小型应用来说，基于2元语法的切分还是够用的。基于2元切分后的索引一般大小和源文件差不多，而对于英文，索引文件一般只有原文件的30%-40%不同。

　　目前比较大的搜索引擎的语言分析算法一般是基于以上2个机制的结合。关于中文的语言分析算法，大家可以在Google查关键词”wordsegment search”能找到更多相关的资料。

　　 =============================分割线===============================

　　由于Lucene不同的版本差距较大，，此系列教程打算把3.5版本，4.5版本，5.0版本都给出个例子，方便大家学习，也方便自己复习。

　　注：由于Lucene5.0版本是基于JDK1.7开发的，所以想学习的同学请配置1.7及以上的版本。故测试Lucene 6.1.0也适用Lucene 5.0中的代码。Lucene 6.1.0最低要求也是JDK1.7.

　　创建索引可分为主要的几步，我自己试验过，不同的版本间会有些不同，但是跟着如下的几大步骤一步一步写，问题不会太大。

　　1、创建Directory

　　2、创建IndexWriter

　　3、创建Document对象

　　4、为Document添加Field

　　5、通过IndexWriter添加文档到索引中

　　搜索可分为如下几步：

　　1、创建Directory

　　2、创建IndexReader

　　3、根据IndexReader创建IndexSearch

　　4、创建搜索的Query

　　5、根据searcher搜索并且返回TopDocs

　　6、根据TopDocs获取ScoreDoc对象

　　7、根据searcher和ScoreDoc对象获取具体的Document对象

　　8、根据Document对象获取需要的值

　　我们向Document添加Field可以有更多的设置，那么都是什么意思呢？

　　name：字段名，很容易理解

　　value：字段值，也很容易理解

　　store和index怎么解释，下面就来看一下这两个选项的可选值：

　　Field.Store.YES或者NO(存储域选项)

　　设置为YES表示或把这个域中的内容完全存储到文件中，方便进行文本的还原

　　设置为NO表示把这个域的内容不存储到文件中，但是可以被索引，此时内容无法完全还原

　　Field.Index(索引选项)

　　Index.ANALYZED:进行分词和索引，适用于标题、内容等

　　Index.NOT_ANALYZED:进行索引，但是不进行分词，如果身份证号，姓名，ID等，适用于精确搜索

　　Index.ANALYZED_NOT_NORMS:进行分词但是不存储norms信息，这个norms中包括了创建索引的时间和权值等信息

　　Index.NOT_ANALYZED_NOT_NORMS:即不进行分词也不存储norms信息

　　Index.NO:不进行索引

　　lucene4.5例子：

import java.io.File;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexUtil {
    private static final String[] ids = { "1", "2", "3" };
    private static final String[] authors = { "Darren", "Tony", "Grylls" };
    private static final String[] titles = { "Hello World", "Hello Lucene", "Hello Java" };
    private static final String[] contents = { "Hello World, I am on my way", "Today is my first day to study Lucene",
            "I like Java" };

    /**
     * 建立索引
     */
    public static void index() {
        IndexWriter indexWriter = null;
        try {
            // 1、创建Directory
            Directory directory = FSDirectory.open(new File("F:/test/lucene/index"));

            // 2、创建IndexWriter
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45);
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_45, analyzer);
            indexWriter = new IndexWriter(directory, config);

            int size = ids.length;
            for (int i = 0; i < size; i++) {
                // 3、创建Document对象
                Document document = new Document();
                // 看看四个参数的意思

                // 4、为Document添加Field
                /**
                 * Create field with String value.
                 * 
                 * @param name
                 *            field name
                 * @param value
                 *            string value
                 * @param type
                 *            field type
                 * @throws IllegalArgumentException
                 *             if either the name or value is null, or if the field's type is neither indexed() nor
                 *             stored(), or if indexed() is false but storeTermVectors() is true.
                 * @throws NullPointerException
                 *             if the type is null
                 * 
                 *             public Field(String name, String value, FieldType type)
                 */

                /**
                 * 注意：这里与3.5版本不同，原来的构造函数已过时
                 */

                /**
                 * 注：这里4.5版本使用FieldType代替了原来的Store和Index，不同的Field预定义了一些FieldType
                 * 
                 */
                // 对ID存储，但是不分词也不存储norms信息
                FieldType idType = TextField.TYPE_STORED;
                idType.setIndexed(false);
                idType.setOmitNorms(false);
                document.add(new Field("id", ids[i], idType));

                // 对Author存储，但是不分词
                FieldType authorType = TextField.TYPE_STORED;
                authorType.setIndexed(false);
                document.add(new Field("author", authors[i], authorType));

                // 对Title存储，分词
                document.add(new Field("title", titles[i], StringField.TYPE_STORED));

                // 对Content不存储，但是分词
                document.add(new Field("content", contents[i], TextField.TYPE_NOT_STORED));

                // 5、通过IndexWriter添加文档到索引中
                indexWriter.addDocument(document);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (indexWriter != null) {
                    indexWriter.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * 搜索
     */
    public static void search() {
        DirectoryReader indexReader = null;
        try {
            // 1、创建Directory
            Directory directory = FSDirectory.open(new File("F:/test/lucene/index"));
            // 2、创建IndexReader
            /**
             * 注意Reader与3.5版本不同：
             * 
             * 所以使用DirectoryReader
             * 
             * @Deprecated public static DirectoryReader open(final Directory directory) throws IOException { return
             *             DirectoryReader.open(directory); }
             */
            indexReader = DirectoryReader.open(directory);
            // 3、根据IndexReader创建IndexSearch
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);
            // 4、创建搜索的Query
            // 使用默认的标准分词器
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45);

            // 在content中搜索Lucene
            // 创建parser来确定要搜索文件的内容，第二个参数为搜索的域
            QueryParser queryParser = new QueryParser(Version.LUCENE_45, "content", analyzer);
            // 创建Query表示搜索域为content包含Lucene的文档
            Query query = queryParser.parse("Lucene");

            // 5、根据searcher搜索并且返回TopDocs
            TopDocs topDocs = indexSearcher.search(query, 10);
            // 6、根据TopDocs获取ScoreDoc对象
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : scoreDocs) {
                // 7、根据searcher和ScoreDoc对象获取具体的Document对象
                Document document = indexSearcher.doc(scoreDoc.doc);
                // 8、根据Document对象获取需要的值
                System.out.println("id : " + document.get("id"));
                System.out.println("author : " + document.get("author"));
                System.out.println("title : " + document.get("title"));
                /**
                 * 看看content能不能打印出来，为什么？
                 */
                System.out.println("content : " + document.get("content"));
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (indexReader != null) {
                    indexReader.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }
}

View Code

转载于:https://www.cnblogs.com/xiehongwei/p/7010367.html

weixin_30873847

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene入门学习

参考博客： http://blog.csdn.net/ayi_5788/article/category/6348409分页： http://blog.csdn.net/hu948162999/article/details/412096991、什么是中文分词　　学过英文的都知道，英文是以单词为单位的，单词与单词之间以空格或者逗号句号隔开。而中文则以字为单位，字又...
复制链接

扫一扫