lucene学习记录(2) - 实时索引,中文分词

最新推荐文章于 2021-02-16 06:04:34 发布

Sidyhe

最新推荐文章于 2021-02-16 06:04:34 发布

阅读量1.5k

点赞数

分类专栏： Java

本文链接：https://blog.csdn.net/Sidyhe/article/details/51817473

版权

Java 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

实时索引

在lucene中, Directory和IndexWriter都是线程安全的, IndexReader也是

但reader不能实时反应writer的修改, 必须重新打开reader才可以

目前本人采用的办法是每次都打开一个新的reader, 虽然效率上会受影响

但目前数据量也就是在5W以内, 且索引在内存中, 尚可接受

中文分词

自带的StandardAnalyzer对于中文来说是按字拆分, 不能满足业务需求

目前lucene的最新版本是6.1.0, 在某度上搜索到的基本都是13年前的文章

什么MM, IK, Paoding, 都无法在最新的lucene上使用(或者本人技术有限)

最终在某歌上找到了一片文章, 在此感谢作者, 原文链接

目前只测试了Jcseg, 效果不错

编译Jcseg及其使用

从git下载源码, 切换到最新版的6.0.0分支, 使用mvn编译即可

所需jar包

jcseg-core-xxx.jar

jcseg-analyzer-xxx.jar

以及lexicon目录, 和jar包放一起

测试中并没有放properties文件, 并不影响(若目录不与jar在一起, 则需要此文件进行配置)

使用方式也很简单, new一个JcsegAnalyzer5X即可

测试代码

先说一下本人现在的需求

具体数据是一段文字, 根据某个或几个关键字进行检索, 只要含有至少一个关键字, 即为符合要求

所以在索引中存放数据库的ID以及内容即可, 其中, 内容只索引不存储(节省空间)

public class LuceneManager {

    final static String ID_COLUMN = "id";
    final static String ITEM_COLUMN = "item";

    final static Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.SIMPLE_MODE);

    Directory dir;
    IndexWriter writer;

    public LuceneManager(Directory dir) throws IOException {
        this.dir = dir;
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        writer = new IndexWriter(dir, config);
    }

    @Override
    protected void finalize() throws Throwable {
        this.close();
        super.finalize();
    }

    public void close() throws IOException {
        if (writer != null) {
            writer.close();
            writer = null;
        }
    }

    public static List<String> analyse(String str) throws IOException {
        List<String> result = new ArrayList<>();

        TokenStream ts = analyzer.tokenStream("", str);
        ts.reset();
        try {
            ts.addAttribute(CharTermAttribute.class);
            while (ts.incrementToken()) {
                CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
                result.add(new String(cta.buffer(), 0, cta.length()));
            }
        } finally {
            ts.close();
        }
        return result;
    }

    public Directory getDirectory() {
        return dir;
    }

    Document buildDocument(long id, String item) {
        Document doc = new Document();

        doc.add(new Field(ID_COLUMN, Long.toString(id), TextField.TYPE_STORED));
        doc.add(new Field(ITEM_COLUMN, item, TextField.TYPE_NOT_STORED));
        return doc;
    }

    public void append(long id, String item) throws IOException {
        writer.addDocument(buildDocument(id, item));
    }

    public void delete(long id) throws IOException {
        Term term = new Term("id", Long.toString(id));

        writer.deleteDocuments(term);
    }

    public void update(long id, String item) throws IOException {
        Term term = new Term("id", Long.toString(id));

        writer.updateDocument(term, buildDocument(id, item));
    }

    List<Long> buildSearchResult(IndexSearcher searcher, TopDocs topDocs) throws IOException {
        List<Long> result = new ArrayList<>();

        for (ScoreDoc sd : topDocs.scoreDocs) {
            Document doc = searcher.doc(sd.doc);
            String id = doc.getField(ID_COLUMN).stringValue();

            result.add(Long.parseLong(id));
        }
        return result;
    }

    public List<Long> search(String keyWords[]) throws IOException {
        IndexReader reader;
        IndexSearcher searcher;

        reader = DirectoryReader.open(writer);
        searcher = new IndexSearcher(reader);
        try {
            String fields[] = new String[keyWords.length];
            BooleanClause.Occur occurs[] = new BooleanClause.Occur[keyWords.length];

            for (int i = 0; i < keyWords.length; i++) {
                fields[i] = ITEM_COLUMN;
                occurs[i] = BooleanClause.Occur.SHOULD;
            }
            Query query = MultiFieldQueryParser.parse(keyWords, fields, occurs, analyzer);
            TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
            return buildSearchResult(searcher, topDocs);
        } catch (ParseException e) {
            throw new IOException(e);
        } finally {
            reader.close();
        }
    }

    public List<Long> search(String keyWord) throws IOException {
        return search(new String[]{keyWord});
    }
}

调用代码


public class AppInst {
    private static AppInst ourInstance = new AppInst();

    public static AppInst getInstance() {
        return ourInstance;
    }

    private AppInst() {
    }

    public static void main(String argv[]) throws Exception {
        AppInst.getInstance().main();
    }

    public void main() throws Exception{
        LuceneManager lm = new LuceneManager(new RAMDirectory());


        // 写入测试数据
        writeRecords(lm);

        System.out.println("搜索测试");
        test(lm);

        System.out.println("删除测试");
        lm.delete(4);
        test(lm);

        System.out.println("更新测试");
        lm.update(1, "从北京到上海飞机票来一张");
        test(lm);

        lm.close();
        lm.getDirectory().close();
    }

    void writeRecords(LuceneManager lm) throws IOException {
        lm.append(1, "来东京啦啦啦");
        lm.append(2, "来南京啦啦啦");
        lm.append(3, "来西京啦啦啦");
        lm.append(4, "来北京啦啦啦");
        lm.append(5, "来上海啦啦啦");
        lm.append(6, "来广州啦啦啦");
        lm.append(7, "来京东啦啦啦");
    }

    void test(LuceneManager lm) throws IOException {
        List<Long> ids;

        ids =  lm.search(new String[]{"北京", "上海"});
        for (Long id : ids) {
            System.out.println("id: " + id);
        }
    }
}

调用结果

搜索测试
id: 4
id: 5
删除测试
id: 5
更新测试
id: 1
id: 5

结束语

关于分词, 仅仅是初级使用, 也可以修改词库来自定义分词

对于lucene理解的还很浅显, 若有错误, 请及时指出.

Sidyhe

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录