使用Lucene开发自己的搜索引擎–(2)配置环境和索引文件的建立indexer

最新推荐文章于 2022-01-24 13:49:04 发布

w踏雪w

最新推荐文章于 2022-01-24 13:49:04 发布

阅读量3k

点赞数

本文链接：https://blog.csdn.net/wen294299195/article/details/8578979

版权

搜索引擎专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章来源：http://www.wenbanana.com/?p=708

一、Lucene安装包下载

由于我是根据《Lucene In Action》第二版这本书来学习Lucene的，书中使用的是3.x版本的Lucene安装包作为教学资料，于是我下载了lucene-3.6.2版本的。大家最好还是使用3.x版本的，不同版本之间会存在一些差异，可能在编程是会造成一些不必要的错误。我下载的是lucene-3.6.2.zip。

下面我给出官方下载地址：http://www.apache.org/dyn/closer.cgi/lucene/java/3.6.2

2.lucene-core-3.6.2.jar的使用

下载完后，大家只要解压到某一个磁盘上即可。下面我们就可以开始编写代码了。搜索引擎可以归结为三步骤：一、网页抓取二、建立索引三、服务用户。本来第一步应该是先去抓取网页，但是我们这次主要讲的是搜索信息，也就是说重点是文献的检索，那么重点就在搜索而不是网页抓取。在这之前，我们要创建一个索引程序Indexer来建立索引文件，方便引擎可以搜索。

3.创建Indexer程序

step1:设置CLASSPATH路径，将lucene-core-3.6.2.jar添加到CLASSPATH下或者可以再Java 工程，右键属性下添加这个jar包也可以。我采用的是后一种方法。

step2:创建LuceneInAction Java工程，工程目录如下：

step3:在写代码之前，我们要先创建两个文件夹，一个是index文件夹，用来保存索引文件；

一个是data文件夹，用来保存数据文件（如txt文件)。文件夹的位置可以随意创建，这里，我创建在解压的安装目录下。分别是："E:\\lucene-3.6.2\\index"和"E:\\lucene-3.6.2\\data"。

接下来还要在data文件夹下创建几个txt文件用来创建索引，内容要用英文，因为我们还没有添加中分分词解析的功能，目前只能针对英文。

step4:接下来就可以编写代码了：

package Lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;


import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * 建立索引
 * @author Administrator
 *
 */
public class Indexer {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception{
	
		String indexDir = "E:\\lucene-3.6.2\\index";///在指定目录创建索引文件夹
		String dataDir = "E:\\lucene-3.6.2\\data";///对指定目录中的“.txt”文件进行索引
		
		long start = System.currentTimeMillis();
		Indexer indexer = new Indexer(indexDir);
		int numIndexed;
		try{
			numIndexed = indexer.index(dataDir, new TextFilesFilter());
		}finally{
			indexer.close();
		}
		long end = System.currentTimeMillis();
		
		System.out.println("索引 "+ numIndexed + " 文件花费 "+
		(end - start) + "ms");
		
	}
	
	
	private IndexWriter writer;
	
	//创建Lucene Index Writer
	public Indexer(String indexDir)throws IOException{
		Directory dir = FSDirectory.open(new File(indexDir));
		/*
		 * Version.LUCENE_30:是版本号参数，Lucene会根据输入的版本值，
		 * 针对该值对应的版本进行环境和行为匹配
		 */
		writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true,
				IndexWriter.MaxFieldLength.UNLIMITED);
	}
	
	//关闭Index Writer
	public void close()throws IOException{
		writer.close();
	}
	
	
	//返回被索引文档文档数
	public int index(String dataDir, FileFilter filter)throws Exception{
		File[] files = new File(dataDir).listFiles();
		
		for(File f:files){
			if(!f.isDirectory() &&
					!f.isHidden()&&
					f.exists()&&
					f.canRead()&&
					(filter == null || filter.accept(f))){
				indexFile(f);
			}
		}
		return writer.numDocs();
	}
	
	//只索引.txt文件，采用FileFilter
	private static class TextFilesFilter implements FileFilter{

		@Override
		public boolean accept(File pathname) {
			// TODO Auto-generated method stub
			return pathname.getName().toLowerCase().endsWith(".txt");
		}
		
	}
	
	protected Document getDocument(File f) throws Exception{
		Document doc = new Document();
		doc.add(new Field("contents", new FileReader(f)));//索引文件内容
		doc.add(new Field("filename", f.getName(),//索引文件名
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc.add(new Field("fullpath", f.getCanonicalPath(),//索引文件完整路径
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		
		return doc;
	}
	
	//向Lucene索引中添加文档
	private void indexFile(File f) throws Exception{
		System.out.println("Indexing "+f.getCanonicalPath());
		Document doc = getDocument(f);
		writer.addDocument(doc);
	}

}

这时编译运行代码，如果没出错的话，会出现下面的结果：

Indexing E:\lucene-3.6.2\data\1.txt
Indexing E:\lucene-3.6.2\data\2.txt
Indexing E:\lucene-3.6.2\data\3.txt
Indexing E:\lucene-3.6.2\data\4.txt
索引 4 文件花费 259ms

在index文件夹下，还会多出很多文件。这就表明索引成功建立了。大家或许会对上面的一些代码抱有疑惑、不解，别急，我会在之后来一一讲解这些类，现在大家有个了解即可。