Lucence原理学习

最新推荐文章于 2024-03-08 20:50:06 发布

IsWuqiongqiong

最新推荐文章于 2024-03-08 20:50:06 发布

阅读量463

点赞数

分类专栏：工具

本文链接：https://blog.csdn.net/Emma_Joans/article/details/79709695

版权

2 篇文章 0 订阅

订阅专栏

  Apache公司提供的全文检索引擎工具包，成熟免费的开源工具，是全文检索引擎的架构，提供了完整的查询引擎和索引引擎。 

  产品：Eclipse的搜索功能， 

  1）如果数据量大的情况下，用数据库来查询，给数据库带来的压力和降低查询速度。 

  2）完全和数据库隔离，减少数据库的压力，提高搜索效率。 

  数据库查询方法： 

  1）顺序索引法：顺序扫描方法，数据量大的情况下非常慢。 

  2）倒叙索引法：提取资源中的关键信息，建立索引。 

  搜索时根据关键字，找到资源位置 

  应用场景：a.单机软件额搜索 

  b.站内搜索（百度贴吧，论坛，京东，智联招聘） 

  c.垂直领域的搜索（智联招聘） 

  d.专业搜索引擎公司 

 
 a）构建索引过程 

 
 确定原始内容---->获取文档---->创建文档---->分析文档---->索引文档 

 
 b）搜索过程 

 
 用户通过搜索界面---->创建查询---->执行搜索 

 
 1）确定原始内容 

  包括互联网上的网页，数据库中的数据，磁盘上的文件等 

 
 2）获取文档 

  互联网上的网页：工具将网页抓取到本地生成html文件，爬虫软件：Solr,Nutch,jsoup 

  数据库中的数据：直接连接数据库读取表中的数据 

  磁盘上的文件：通过I/O操作读取文件的内容 

 
 3）创建文档 

  将原始内容创建成文档，文档中包含一个个域，域中存储内容 

 
 4）分析文档 

  将域中的内容进行分析，分析成一个个单词 

 
 5）索引文档 

  创建索引是对词汇单元的索引，通过词语找文档，这种结构叫倒叙索引结构 

  倒叙索引结构也叫反向索引结构，包括索引和文档两部分，索引即词汇表，他的规模比较小，而文档集合较大。 

 
 // 1.数据采集 

 
  BookDao bookDao = new BookDaoImpl(); 

 
  List<Book> bookList = bookDao.queryBookList(); 

 
  // 2.创建document文档 

 
  List<Document> documents = new ArrayList<>(); 

 
  for(Book book : bookList) { 

 
  Document document = new Document(); 

 
  // Document文档中添加Field域 

 
  // 图书Id 

 
  // Store.YES:表示存储到文档域中 

 
  document.add(new TextField("id", book.getId().toString(), Store.YES)); 

 
  // 图书名称 

 
  document.add(new TextField("name", book.getName().toString(), Store.YES)); 

 
  // 图书价格 

 
  document.add(new TextField("price", book.getPrice().toString(), Store.YES)); 

 
  // 图书图片地址 

 
  document.add(new TextField("pic", book.getPic().toString(), Store.YES)); 

 
  // 图书描述 

 
  document.add(new TextField("desc", book.getDesc().toString(), Store.YES)); 

 
  documents.add(document); 

}

 
  // 3.创建分析器 

 
  Analyzer analyzer = new StandardAnalyzer(); 

 
  // 4.创建indexWriterConfig配置信息类 

 
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3,analyzer); 

 
  // 5.创建Directory对象，声明索引库的位置 

 
  Directory directory = FSDirectory.open(new File("G:\\JavaEE60\\index")); 

 
  // 6.创建indexWriter写入对象 

 
  IndexWriter indexWriter = new IndexWriter(directory,config); 

 
  // 7.把Document写入到索引库中 

 
  for(Document doc : documents ) { 

 
  indexWriter.addDocument(doc); 

}

 
  // 8.释放资源 

 
  indexWriter.close(); 

  1）用户 

  2）用户搜索界面 

  3）创建查询 

  4）执行搜索 

  5）渲染结果 

 
 // 1.创建query搜索对象 

 
  // 创建分词器 

 
  Analyzer analyzer = new StandardAnalyzer(); 

 
  // 创建搜索解析器，第一个参数默认是field域，第二个参数:分词器 

 
  QueryParser queryParser = new QueryParser("name",analyzer); 

 
  // 创建搜索对象 

 
  Query query = queryParser.parse("desc:java AND lucene"); 

 
  // 2.创建directory流对象，声明索引库的位置 

 
  Directory directory = FSDirectory.open(new File("G:\\JavaEE60\\index")); 

 
  // 3.创建索引读取对象IndexReader 

 
  IndexReader reader = DirectoryReader.open(directory); 

 
  // 4.创建索引搜索对象IndexSearch 

 
  IndexSearcher search = new IndexSearcher(reader); 

 
  // 5.创建索引搜索对象,执行搜索，返回结果集 

 
  TopDocs top = search.search(query, 10); 

 
  ScoreDoc[] docs = top.scoreDocs; 

 
  // 6. 解析结果集 

 
  for (ScoreDoc scoreDoc : docs) { 

 
  // 获取文档 

 
  int docID = scoreDoc.doc; 

 
  Document doc = search.doc(docID); 

 
  System.out.println("============================="); 

 
  System.out.println("docID:" + docID); 

 
  System.out.println("bookId:" + doc.get("id")); 

 
  System.out.println("name:" + doc.get("name")); 

 
  System.out.println("price:" + doc.get("price")); 

 
  System.out.println("pic:" + doc.get("pic")); 

 
  // System.out.println("desc:" + doc.get("desc")); 

}

 
  // 7. 释放资源 

 
  reader.close(); 

  分词： 
 采集到的数据会存储到 
 document 
 对象的 
 Field 
 域中，分词就是将 
 Document 
 中 
 Field 
 的 
 value 
 值切分成一个一个的词 

  过滤： 
 包括去除标点符号过滤、去除停用词过滤（的、是、 
 a 
 、 
 an 
 、 
 the 
 等）、大写转小写、词的形还原（复数形式转成单数形参、过去式转成现在式。。。）等。 

 
 Field是文档中的域。 

 
 是否分词。 

 
 是否索引。 

 
 是否存储。 

关注