自己动手写搜索引擎(常搜吧历程三#搜索#)(Java、Lucene、hadoop)

本文介绍了如何使用Java和Lucene库创建全文搜索引擎。通过讲解Lucene的关键类如IndexSearcher、Term、Query等,阐述了搜索过程。文章还展示了将data.txt文件内容建立索引的实例,帮助理解Lucene的工作原理。
摘要由CSDN通过智能技术生成

Lucene的常用检索类

1、IndexSercher:检索操作的核心组件,用于对IndexWriter创建的索引执行,只读的检索操作,工作模式为接受Query对象而返回ScoreDoc对象。

2、Term:检索的基本单元,标示检索的字段名称和检索对象的值,如Term("title", "lucene")。即表示在title字段中搜索关键词lucene。

3、Query:表示查询的抽象类,由相应的Term来标识。

4、TermQuery:最基本的查询类型,用于匹配含有制定值字段的文档。

5、TopDoc:保存查询结果的类。

6、ScoreDoc(Hits):用来装载搜索结果文档队列指针的数组容器。


我们先新建一个索引类:

package com.qianyan.luceneIndex;

import java.io.IOException;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class IndexTest {

	public static void main(String[] args) throws IOException{
	
		String[] ids = {"1", "2", "3", "4"};
		String[] names = {"zhangsan", "lisi", "wangwu", "zhaoliu"};
		String[] addresses = {"shanghai", "beijing", "guangzhou", "nanjing"};
		String[] birthdays = {"19820720", "19840203", "19770409", "19830130"};
		Analyzer analyzer = new StandardAnalyzer();
		String indexDir = "E:/luceneindex";
		Directory dir = FSDirectory.getDirectory(indexDir);
		//true 表示创建或覆盖当前索引;false 表示对当前索引进行追加
		//Default value is 128
		IndexWriter writer = new IndexWriter(dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
		for(int i = 0; i < ids.length; i++){
			Document document = new Document();
			document.add(new Field("id", ids[i], Field.Store.YES, Field.Index.ANALYZED));
			document.add(new Field("name", names[i], Field.Store.YES, Field.Index.ANALYZED));
			document.add(new Field("address", addresses[i], Field.Store.YES, Field.Index.ANALYZED));
			document.add(new Field("birthday", birthdays[i], Field.Store.YES, Field.Index.ANALYZED));
			writer.addDocument(document);
		}
		writer.optimize();
		writer.close();
	}
	
}

下面来看简答的检索类:

package com.qianyan.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class TestSeacher {

	public static void main(String[] args) throws IOException {
		String indexDir = "E:/luceneindex";
		Directory dir = FSDirectory.getDirectory(indexDir);
		IndexSearcher searcher = new IndexSearcher(dir);
		ScoreDoc[] hits = null;
		
		Term term = new Term("id", "2");
		TermQuery query = new TermQuery(term);
		TopDocs topDocs = searcher.search(query, 5);
		
		/* 范围检索: 19820720 - 19830130 。 true表示包含首尾
		Term beginTerm = new Term("bithday", "19820720");
		Term endTerm = new Term("bithday", "19830130");
		RangeQuery rangeQuery = new RangeQuery(beginTerm, endTerm, true);
		TopDocs topDocs = searcher.search(rangeQuery, 5);
		*/
		
		/* 前缀检索:
		Term term = new Term("name", "z");
		PrefixQuery preQuery = new PrefixQuery(term);
		TopDocs topDocs = searcher.search(preQuery, 5);
		*/
		
		/* 模糊查询:例如查找name为zhangsan的数据,那么name为zhangsun、zhangsin也会被查出来
		Term term = new Term("name", "zhangsan");
		FuzzyQuery fuzzyQuery = new FuzzyQuery(term);
		TopDocs topDocs = searcher.search(fuzzyQuery, 5);
		*/
		

		/* 匹配通配符: * 任何条件 ?占位符
		Term term = new Term("name", "*g??");
                WildcardQuery wildcardQuery = new WildcardQuery(term);              
                TopDocs topDocs = searcher.search(wildcardQuery, 5);
                */
		
		/* 多条件联合查询
		Term nterm = new Term("name", "*g??");
		WildcardQuery wildcardQuery = new WildcardQuery(nterm);
		
		Term aterm = new Term("address", "nanjing");
		TermQuery termQuery = new TermQuery(aterm);
		
		BooleanQuery query = new BooleanQuery();
		query.add(wildcardQuery, BooleanClause.Occur.MUST); //should表示"或" must表示"必须"
		query.add(termQuery, BooleanClause.Occur.MUST);
		
		TopDocs topDocs = searcher.search(query, 10);
		*/
		 
		hits = topDocs.scoreDocs;
		
		for(int i = 0; i < hits.length; i++){
			Document doc = searcher.doc(hits[i].doc);
			//System.out.println(hits[i].score);
			System.out.print(doc.get("id") + " ");
			System.out.print(doc.get("name") + " ");
			System.out.print(doc.get("address") + " ");
			System.out.println(doc.get("birthday") + " ");
		}
		
		searcher.close();
		dir.close();
	}
}


下面我们来看一个全文索引的案例,data.txt 见文章最下面。首先我们建立对文章的索引:

package com.qianyan.lucene;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class TestFileReaderForIndex{

	public static void main(String[] args) throws IOException{
		File file = new File("E:/data.txt");
		FileReader fRead = new FileReader(file);
		char[] chs = new char[60000];
		fRead.read(chs);
		
		String strtemp = new String(chs);
		String[] strs = strtemp.split("Database: Compendex");
		
		System.out.println(strs.length);
		for(int i = 0; i < strs.length; i++)
			strs[i] = strs[i].trim();
		
		Analyzer analyzer = new StandardAnalyzer();
		String indexDir = "E:/luceneindex";
		Directory dir = FSDirectory.getDirectory(indexDir);
		
		IndexWriter writer = new IndexWriter(dir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED);
		
		for(int i = 0; i < strs.length; i++){
			Document document = new Document();
			document.add(new Field("contents", strs[i], Field.Store.YES, Field.Index.ANALYZED));
			writer.addDocument(document);
		}
		
		writer.optimize();
		writer.close();
		dir.close();
		System.out.println("index ok!");
	}
}

对上述追加索引进行简单搜索:

package com.qianyan.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class TestSeacher2 {

	public static void main(String[] args) throws IOException {
		String indexDir = "E:/luceneindex";
		Directory dir = FSDirectory.getDirectory(indexDir);
		IndexSearcher searcher = new IndexSearcher(dir);
		ScoreDoc[] hits = null;
		
		Term term = new Term("contents", "ontology");
		TermQuery query = new TermQuery(term);
		TopDocs topDocs = searcher.search(query, 126);
	
		hits = topDocs.scoreDocs;
		
		for(int i = 0; i < hits.length; i++){
			Document doc = searcher.doc(hits[i].doc);
			System.out.print(hits[i].score);
			System.out.println(doc.get("contents"));
		}
		
		searcher.close();
		dir.close();
	}
}

好了,简单的检索方式就介绍这些 。


data.txt 内容如下 :


1. Modeling the adsorption of CD(II) onto Muloorina illite and related clay minerals
Lackovic, Kurt (La Trobe University, P.O. Box 199, Bendigo, Vic. 3552, Australia)  Angove, Michael J.  Wells, John D.  Johnson, Bruce B. Source: Journal of Colloid and Interface Science, v 257, n 1, p 31-40, 2003
Database: Compendex
 
 
 
 2.  Experimental Study of the Adsorption of an Ionic Liquid onto Bacterial and Mineral Surfaces
Gorman-Lewis, Drew J. (Civ. Eng. and Geological Sciences, University of Notre Dame, Notre Dame, IN 46556-0767, United States)  Fein, Jeremy B. Source: Environmental Science and Technology, v 38, n 8, p 2491-2495, April 15, 2004
Database: Compendex
 
 
 
 3.  Grafting of hyperbranched polymers onto ultrafine silica: Postgraft polymerization of vinyl monomers initiated by pendant initiating groups of polymer chains grafted onto the surface
Hayashi, Shinji (Grad. Sch. of Science and Technology, Niigata Univ., 8050, I., Niigata, Japan)  Fujiki, Kazuhiro  Tsubokawa, Norio Source: Reactive and Functional Polymers, v 46, n 2, p 193-201, December 2000
Database: Compendex
 
 
 
 4.  The influence of pH, electrolyte type, and surface coating on arsenic(V) adsorption onto kaolinites
Cornu, Sophie (Unité de Sciences du Sol, INRA d'Orléans, av. de la Pomme de pin, Ardon, 45166 Olivet Cedex, France)  Breeze, Dominique  Saada, Alain  Baranger, Philippe Source: Soil Science Society of America Journal, v 67, n 4, p 1127-1132, July/August 2003
Database: Compendex
 
 
 
 5.  Adsorption behavior of statherin and a statherin peptide onto hydroxyapatite and silica surfaces by in situ ellipsometry
Santos, Olga (Biomedical Laboratory Science and Biomedical Technology, Faculty of Health and Society, Malm? University, SE-20506 Malm?, Sweden)  Kosoric, Jelena  Hector, Mark Prichard  Anderson, Paul  Lindh, Liselott Source: Journal of Colloid and Interface Science, v 318, n 2, p 175-182, Febrary 15, 2008
Database: Compendex
 
 
 
 6.  Sorption of surfactant used in CO2 flooding onto five minerals and three porous media
Grigg, R.B. (SPE, New Mexico Recovery Research Center)  Bai, B. Source: Proceedings - SPE International Symposium on Oilfield Chemistry, p 331-342, 2005, SPE International Symposium on Oilfield Chemistry Proceedings
Database: Compendex
 
 
 
 7.  Influence of charge density, sulfate group position and molecular mass on adsorption of chondroitin sulfate onto coral
Volpi, Nicola (Department of Animal Biology, Biological Chemistry, University of Modena and Reggio Emilia, Via Campi 213/d, 41100 Modena, Italy) Source: Biomaterials, v 23, n 14, p 3015-3022, 2002
Database: Compendex
 
 
 
 8.  Kinetic consequences of carbocationic grafting and blocking from and onto
Ivan, Bela (Univ of Akron, United States) Source: Polymer Bulletin, v 20, n 4, p 365-372, Oct
Database: Compendex
 
 
 
 9.  Assemblies of concanavalin A onto carboxymethylcellulose
Castro, Lizandra B.R. (Instituto de Química, Universidade de S?o Paulo, Av. Prof. Lineu Prestes 748, 05508-900, S?o Paulo, Brazil)  Petri, Denise F.S. Source: Journal of Nanoscience and Nanotechnology, v 5, n 12, p 2063-2069, December 2005
Database: Compendex
 
 
 
 10.  Surface grafting of polymers onto glass plate: Polymerization of vinyl monomers initiated by initiating groups introduced onto the surface
Tsubokawa, Norio (Niigata Univ, Niigata, Japan)  Satoh, Masayoshi Source: Journal of Applied Polymer Science, v 65, n 11, p 2165-2172, Sep 12
Database: Compendex
 
 
 
 11.  Photografting of vinyl polymers onto ultrafine inorganic particles: photopolymerization of vinyl monomers initiated by azo groups introduced onto these surfaces
Tsubokawa, Norio (Niigata Univ, Niigata, Japan)  Shirai, Yukio  Tsuchida, Hideyo  Handa, Satoshi Source: Journal of Polymer Science, Part A: Polymer Chemistry, v 32, n 12, p 2327-2332, Sept
Database: Compendex
 
 
 
 12.  Graft polymerization of methyl methacrylate initiated by pendant azo groups introduced onto γ-poly (glutamic acid)
Tsubokawa, Norio (Niigata Univ, Niigita, Japan)  Inagaki, Masatoshi  Endo, Takeshi Source: Journal of Polymer Science, Part A: Polymer Chemistry, v 31, n 2, p 563-568, Feb
Database: Compendex
 
 
 
 13.  The sorpt
以下是部分题目: 一、选择题 1、下面哪个不是计算机网络基本拓扑结构。( ) A、总线型 B、环型 C、树型 D、混合型 2、b/s表示什么意思。( ) A、每秒钟传送的二进制位数; B、每秒钟传送的字符数 C、每秒钟传送的字节数 D、每秒钟传送的十进制位数 3、OSI参考模型的下层是指( ) A、应用层、表示层、会话层 B、会话层、传输层、网络层 C、物理层、数据链路层、网络层 D、物理层、数据链路层、传输层 4、计算机网络是由通信子网和_______组成。 A、资源子网 B、协议子网 C、国际互联网 D、TCP/IP 5、网络中的拓扑结构主要有总线型、星型、树型、网状型、_____等。 A、 混合型 B、主干型 C、环型 D、网络型 6、一个单位内的一个计算机网络系统,属于_____。 A、PAN B、LAN C、WAN D、MAN 7、双绞线做法EIA/TIA568B标准的线序是( ) A、白橙、蓝、白绿、橙、白蓝、绿、白棕、棕 B、白绿、绿、白橙、蓝、白蓝、橙、白棕、棕 C、白橙、橙、白绿、蓝、白蓝、绿、白棕、棕 D、白绿、绿、白橙、蓝、白蓝、白棕、橙、棕 8、下面哪一项不是计算机网络最重要的功能( ) A、数据通信 B、资源共享 C、分布处理 D、路径选择 9、下面哪一项不是计算机网络按地理范围分类的类型。( ) A、局域网 B、无线网 C、广域网 D、城域网 10、目前IPV4地址已基本分配完毕,将来使用的IPV6的地址采用____表示。 A、 16位 B、32位 C、64位 D、128位
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值