luncene搜素引擎实现

最新推荐文章于 2023-02-22 01:06:20 发布

Muroidea

最新推荐文章于 2023-02-22 01:06:20 发布

阅读量917

点赞数

分类专栏： java web

本文链接：https://blog.csdn.net/u014087707/article/details/48938449

版权

java web 专栏收录该内容

26 篇文章 0 订阅

订阅专栏

1. Lucene大致结构

1.1 互联网搜索结构框图

说明:

1) 当用户打开www.baidu.com网页搜索某些数据的时候，不是直接找的网页，而是找的百度的索引库。索引库里包含的内容有索引号和摘要。当我们打开www.baidu.com时，看到的就是摘要的内容。

2) 百度的索引库的索引和互联网的某一个网站对应。

3) 当用户数据要查询的关键字，返回的页面首先是从索引库中得到的。

4) 点击每一个搜索出来的内容进行相关网页查找，这个时候才找的是互联网中的网页。

1.2 lucene的大致结构框图

说明:

1) 在数据库中，数据库中的数据文件存储在磁盘上。索引库也是同样，索引库中的索引数据也在磁盘上存在，我们用Directory这个类来描述。

2) 我们可以通过API来实现对索引库的增、删、改、查的操作。

3) 在数据库中，各种数据形式都可以概括为一种：表。在索引库中，各种数据形式也可以抽象出一种数据格式为Document。

4) Document的结构为：Document(List<Field>)

5) Field里存放一个键值对。键值对都为字符串的形式。

6) 对索引库中索引的操作实际上也就是对Document的操作。

1. 第一个lucene程序

1.1 准备lucene的开发环境

搭建lucene的开发环境，要准备lucene的jar包，要加入的jar包至少有：

1) lucene-core-3.1.0.jar (核心包)

2) lucene-analyzers-3.1.0.jar (分词器)

3) lucene-highlighter-3.1.0.jar (高亮器)

4) lucene-memory-3.1.0.jar (高亮器)

1.2 建立索引

创建Article这个javabean类:

public class Article {
	private Integer id;
	private String title;
	private String context;
	
	public Integer getId() {
		return id;
	}
	public void setId(Integer id) {
		this.id = id;
	}
	public String getTitle() {
		return title;
	}
	public void setTitle(String title) {
		this.title = title;
	}
	public String getContext() {
		return context;
	}
	public void setContext(String context) {
		this.context = context;
	}
	
}

开发代码:

public class LuceneTest {

	@Test
	public void show() throws Exception{
		
		Article  article=new Article();
		article.setId(30);
		article.setTitle("北京");
		article.setContext("这是演示");
		//把数据写入一个索引库中
		Directory d=FSDirectory.open(new File("./indexDir"));	
                //分词器对象,对数据进行分词
		Analyzer al=new StandardAnalyzer(Version.LUCENE_30);
                //创建IndexWrite对象
		IndexWriter iw= new IndexWriter(d, al,MaxFieldLength.LIMITED); 
   		
                //创建Document 
                Document doc =new Document();

                Field idfiled =new Field("id",article.getId().toString(),Store.YES,Index.NOT_ANALYZED);
		Field titlefiled =new Field("title",article.getTitle(),Store.YES,Index.ANALYZED);
		Field contextfiled =new Field("context",article.getContext(),Store.YES,Index.ANALYZED);
		doc.add(idfiled);
  		doc.add(titlefiled);
		doc.add(contextfiled);
		iw.addDocument(doc);
		iw.close();
		
	}
}

搜索实现

@Test
	public void sreach() throws Exception{
		//要对对象进行检索
		Directory directory=FSDirectory.open(new File("./index"));
               //创建搜索的对象IndexSeacher
               IndexSearcher is = new IndexSearcher(directory);
               //分词器存储和解析用同一个分词器
               Analyzer aa=new StandardAnalyzer(Version.LUCENE_30);              
               //创建搜索的QueryParser对象
               QueryParser queryparser=new QueryParser(Version.LUCENE_30, "title",aa);
	      //要检索的关键词
               Query query = queryparser.parse("lucene");
               //搜索返回的是TopDocs
                TopDocs topdocs = is.search(query,2);
		int count  = topdocs.totalHits;//根据关键词查出来的总的记录
                //获取前n行目录的id列表
                ScoreDoc[]  scoredocs = topdocs.scoreDocs;
		
		List<Article> articlelist =new ArrayList<Article>();
               //获取每行的Document,并把document转换成Article 
               for(ScoreDoc sd :scoredocs){
			float socre = sd.score;//关键词得分
			int  index = sd.doc;//索引的下标
			Document  document = is.doc(index);
			//把document转换成article
			Article article =new Article();
				article.setId(Integer.parseInt(document.get("id")));
				article.setTitle(document.get("title"));
				article.setContext(document.get("context"));
			articlelist.add(article);
		}
		
		for(Article a :articlelist){
				System.out.println(a.getId());
				System.out.println(a.getTitle());
				System.out.println(a.getContext());
		}
		
	}

1. 索引库的操作

1..1 保持数据库与索引库的同步

说明：

在一个系统中，如果索引功能存在，那么数据库和索引库应该是同时存在的。这个时候需要保证索引库的数据和数据库中的数据保持一致性。可以在对数据库进行增、删、改操作的同时对索引库也进行相应的操作。这样就可以保证数据库与索引库的一致性。

1.2工具类DocumentUtils

什么情况下使用Index.NOT_ANALYZED

当这个属性的值代表的是一个不可分割的整体，例如 ID

什么情况下使用Index.ANALYZED

当这个属性的值代表的是一个可分割的整体

1.3LuceneConfig

LuceneConfig这个类把Directory和Analyzer进行了包装。因为在创建IndexWriter时，需要用到这两个类，而管理索引库的操作都要用到IndexWriter这个类，所以我们对Directory和Analyzer进行了包装

1.4管理索引库

1.4.1增加

@Test
    public  void create() throws Exception{
        Article article =new Article();
        article.setId(1);
        article.setTitle("lucene可以做搜索引擎");
        article.setContext("baidu,google都是很好的搜索引擎");
        
        IndexWriter indexWriter =new IndexWriter(LuceneUtil.directory,LuceneUtil.analyzer,MaxFieldLength.LIMITED);
        indexWriter.addDocument(DocumentUtil.Article2docuemnt(article));

        indexWriter.close();
    }

1.4.2删除

@Test
    public void update() throws CorruptIndexException, LockObtainFailedException, IOException{
        IndexWriter indexWriter =new IndexWriter(LuceneUtil.directory,LuceneUtil.analyzer,MaxFieldLength.LIMITED);
        Term arg0 =new Term("lucene","title");
        indexWriter.deleteDocuments(arg0);
        indexWriter.close();
    }

说明：indexWriter.deleteDocuments的参数为Term.Term指关键词。因为ID的索引保存类型为Index.NOT_ANALYZED,因为直接写ID即可。

1.4.3修改

说明：lucene的更新操作与数据库的更新操作是不一样的。因为在更新的时候，有可能变换了关键字的位置，这样分词器对关键字还得重新查找，而且还得在目录和内容中替换，这样做的效率比较低，所以lucene的更新操作是删除和增加两步骤来完成的。

@Test
    public void update2(Article article) throws CorruptIndexException, LockObtainFailedException, IOException{
        IndexWriter indexWriter =null;
        
        try {
            new IndexWriter(LuceneUtil.directory,LuceneUtil.analyzer,MaxFieldLength.LIMITED);
            Term arg0 =new Term("lucene",article.getId().toString());
            Document document  = DocumentUtil.Article2docuemnt(article);
            indexWriter.updateDocument(arg0, document);
            indexWriter.close();
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

5.lucene的IndexWriter

如果同时在一个索引库中同时建立两个IndexWriter,会有异常

出现write.lock这个文件。因为当一个IndexWriter在进行读索引库操作的时候，lucene会为索引库，以防止其他IndexWriter访问索引库而导致数据不一致，直到IndexWriter关闭为止。

结论：同一个索引库只能有一个IndexWriter进行操作。

封装IndexWriter的类LuceneUtils

注：这里用单例模式做比较好。

1. 索引库的优化

当执行创建索引多次时，索引库的文件如图所示：(索引里内容是一样的)

结论：如果增加、删除反复操作很多次，就会造成文件大量增加，这样检索的速度也会下降，所以我们有必要去优化索引结构。使文件的结构发生改变从而提高效率。

1.1手动合并文件

public void optimize() {
		try {
			LuceneUtil.getIndexWrite().optimize();
			LuceneUtil.getIndexWrite().close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

在执行完上述代码后，索引库的结构为：

可以看出把该合并的项都合并了。把del文件彻底全部删除掉了。

1.2自动合并文件

LuceneUtils.getIndexWriter().setMergeFactor(3)意思为当文件的个数达到3的时候，合并成一个文件。如果没有设置这个值，则使用默认的情况：10

1.3内存索引库

@Test
	public void show() throws Exception{
		
		Article  article=new Article();
		article.setId(30);
		article.setTitle("北京");
		article.setContext("这是演示");
		//把数据写入一个索引库中
		//关闭索引
		
			Directory d=new <span style="background-color: rgb(255, 102, 102);">RAMDirectory(LuceneUtil.directory);</span>
			
			Analyzer al=new StandardAnalyzer(Version.LUCENE_30);

		IndexWriter iw= new IndexWriter(d, al,MaxFieldLength.LIMITED); 
			Document doc =new Document();
			Field idfiled =new Field("id",article.getId().toString(),Store.YES,Index.NOT_ANALYZED);
			Field titlefiled =new Field("title",article.getTitle(),Store.YES,Index.ANALYZED);
			Field contextfiled =new Field("context",article.getContext(),Store.YES,Index.ANALYZED);
			doc.add(idfiled);
			doc.add(titlefiled);
			doc.add(contextfiled);
		iw.addDocument(doc);
		iw.close();
		
	}

1. 分词器

英文分词器

中文分词器

1.2.1单字分词

Analyzeranalyzer2 = new ChineseAnalyzer();

把汉字一个字一个字分解出来。效率比较低。

1.2.2二分法分词

Analyzeranalyzer3 = new CJKAnalyzer(Version.LUCENE_30);

把相邻的两个字组成词分解出来，效率也比较低。而且很多情况下分的词不对。

1.2.3词库分词(IKAnalyzer)

Analyzeranalyzer4 = new IKAnalyzer();

基本上可以把词分出来(经常用的分词器)

1.2.4词库的扩充

类路径下建立

IKAnalyzer.cfg.xml为IKAnalyzer的配置文件。

Key为ext_stopwords 为停止词所在的位置。

Key为ext_dict为配置自己的扩展字典所在的位置。如图所示可以在mydict.dic中添加自己所需要的词。

中华人民共和国

7.2.5修改LuceneConfig类

analyzer =new IKanyzer();

以后用的分词库为IKAnalyzer中文分词库。

1. 查询索引库

MultiFieldQueryParser.这个类的好处可以选择多个属性进行查询。而QueryParser只能选择一个。

1.1.1分页

/**
	 * 
	 * @param firstResult
	 * 			开始的位置
	 * @param maxResult
	 * 			要查询几条数据
	 * @throws Exception
	 */
	public void dispage(int firstResult,int maxResult) throws Exception{
		IndexSearcher indexSearcher =new IndexSearcher(LuceneUtil.directory);
		QueryParser queryParser =new MultiFieldQueryParser(Version.LUCENE_30,new String [] {"title","context"},LuceneUtil.analyzer);
		Query  query = queryParser.parse("lucens");
		TopDocs topdocs = indexSearcher.search(query, 25);
		ScoreDoc[] scoreDocs =	topdocs.scoreDocs;
		List<Article> articles =new ArrayList<Article>();
		//解决最后如果数据不够十条的时候回出现数组越界的问题
		int count = Math.min(topdocs.totalHits,(firstResult+maxResult));
		//有一个问题如果只小于maxResult当firstResult相同时就没有数据
		for(int i=firstResult;i<count;i++){
			Article article = DocumentUtil.document2Article(indexSearcher.doc(scoreDocs[i].doc));
			articles.add(article);
		}
		for(Article article : articles){
			System.out.println(article.getId());
			System.out.println(article.getTitle());
			System.out.println(article.getContext());
		}
	}

	@Test
	public void findpage() throws Exception{
		this.dispage(20, 10);
	}

1.1.1分页

public  void seacher(Query query) throws Exception{
        IndexSearcher indexSearcher =new IndexSearcher(LuceneUtil.directory);
        TopDocs topdocs = indexSearcher.search(query, 25);
        ScoreDoc[] scoredoc = topdocs.scoreDocs;
        /************************************************************************/
        //给关键字加上前缀和后缀
        Formatter formatter =new SimpleHTMLFormatter("<font color='red'>", "</font>") ;
        
        Scorer fragmentScorer =new QueryScorer(query);
        Highlighter highlighter =new Highlighter(formatter, fragmentScorer);
        //创建摘要
        Fragmenter fragmenter =new SimpleFragmenter(20);
        highlighter.setTextFragmenter(fragmenter);
        /************************************************************************/
        List<Article> articlelist = new ArrayList<Article>();
        for(ScoreDoc sc:scoredoc){
            Document document =  indexSearcher.doc(sc.doc);
            Article article = DocumentUtil.document2Article(document);
            String text = highlighter.getBestFragment(LuceneUtil.analyzer,"title" ,document.get("title"));
            String contextext = highlighter.getBestFragment(LuceneUtil.analyzer,"context" ,document.get("context"));
            article.setTitle(text);
            if(contextext==null){
                article.setContext(contextext);
            }
            articlelist.add(article);
        }
        
        for(Article article :articlelist){
            System.out.println(article.getId());
            System.out.println(article.getTitle());
            System.out.println(article.getContext());
        }
        
    }

1. Lucene的核心API介绍

1.1.1IndexWriter

1) 利用这个类可以对索引库进行增、删、改操作。

2) 利用构造方法IndexWriter indexWriter = newIndexWriter(directory,LuceneConfig.analyzer,MaxFieldLength.LIMITED)可以构造一个IndexWriter的对象。

3) addDocument 向索引库中添加一个Document

4) updateDocument 更新一个Document

5) deleteDocuments 删除一个Document

1.1.2 Directory

指向索引库的位置，有两种Directory ,文件FSD和内存RAM

1.1.2.1FSDirectory

1) 通过FSDirectory.open(new File("./indexDir"))建立一个indexDir的文件夹，而这个文件夹就是索引库存放的位置。

2) 通过这种方法建立索引库时如果indexDire文件夹不存在，程序将自动创建一个，如果存在就用原来的这个。

3) 通过这个类可以知道所建立的索引库在磁盘上，能永久性的保存数据。这是优点

4) 缺点为因为程序要访问磁盘上的数据，这个操作可能引发大量的IO操作，会降低性能。

1.1.2.2RAMDirectory

1) 通过构造函数的形式Directory ramdirectory = new RAMDirectory(fsdirectory)可以建立RAMDirectory。

2) 这种方法建立的索引库会在内存中开辟一定的空间，通过构造函数的形式把fsdirectory移动到内存中。

3) 这种方法索引库中的数据是暂时的，只要内存的数据消失，这个索引库就跟着消失了。

4) 因为程序是在内存中跟索引库交互，所以利用这种方法创建的索引的好处就在效率比较高，访问速度比较快。

1.1.3 Document

1) 通过无参的构造函数可以创建一个Document对象。Document doc = new Document();

2) 一个Directory是由很多Document组成的。用户从客户端输入的要搜索的关键内容被服务器端包装成JavaBean，然后再转化为Document

//把 Article转换成document
		Document document =new Document();
		Field idField =new Field("id",article.getId().toString(),Store.YES,Index.NOT_ANALYZED);
		Field titleField =new Field("title",article.getTitle(),Store.YES,Index.ANALYZED);
		Field contextField =new Field("context",article.getContext(),Store.YES,Index.ANALYZED);
		document.add(idField);
		document.add(titleField);
		document.add(contextField);
		
		return document;