使用Lucene+Paoding构建的站内搜索

最新推荐文章于 2024-09-19 15:10:23 发布

mudan1

最新推荐文章于 2024-09-19 15:10:23 发布

阅读量100

点赞数

分类专栏： Lucene 文章标签： lucene quartz junit Bean DAO

本文链接：https://blog.csdn.net/mudan1/article/details/83851947

版权

Lucene 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

目标：创建一个具有高度可移植的，定时创建索引的站内搜索。
途径：dic和index都放到程序中去。

准备：
1 Lucene
Lucene Java(以下简称Lucene)目前可用版本是2.4.0，关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html。

2 Paoding
Qieqie同学的伟大作品、优秀的Lucene中文分词组件，目前的版本为paoding-analysis-2.0.4-beta，对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/。

3 下载最新的paoding-analysis-2.0.4-beta版本（里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar）。

开始工作：
1 试运行
打开下载包中的examples文件夹，运行一下吧（注意一下编码）。

2 集成到系统中去（系统结构Action->service->dao）
1）由于系统是web系统，因此在配置Paoding上就有可能和第一步有些不同。
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件，修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件，修改paoding.dic.home=classpath:dic，使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了，但是如果你是制定为classpath:dic，则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法，

修改

File dicHomeFile = getFile(dicHome);

为

File dicHomeFile2 = getFile(dicHome);
String path = "";
try {
	path = URLDecoder.decode(dicHomeFile2.getPath(), "UTF-8");
} catch (UnsupportedEncodingException e) {
	e.printStackTrace();
}
File dicHomeFile = new File(path);

目的是解码，不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。

2）表结构

CREATE TABLE `news` (    
  `id` int(11) NOT NULL auto_increment,    
  `title` varchar(255) default NULL,    
  `details` mediumtext,    
  `author` varchar(255) default NULL,    
  `publisher` varchar(100) default NULL,    
  `clicks` int(11) default NULL,    
  `source` varchar(255) default NULL,    
  `addtime` datetime default NULL,    
  ` category ` varchar(100) default NULL,    
  `keywords` varchar(255) default NULL,    
  PRIMARY KEY  (`id`)    
) ENGINE=InnoDB DEFAULT CHARSET=gbk;

3 正式实施编码
编写站内搜索分为两步：创建索引和进行搜索，所需类：SearchAction.java和TaskAction.java(同一目录)
1）创建索引
主要任务：从已有的txt文件中读取上一次进行索引的最后一条新闻的id号，然后从业务逻辑中查找大于这个id号的所有新闻进行索引，最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。
新建TaskAction，增加如下方法

public void createIndex() {
  String path;
  try { 
//两个参数：创建索引的位置  和 上一次创建索引最后的新闻id所在文件
 createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt");
  } catch (Exception e) {
   e.printStackTrace();
  }
 }

public String getPath(Class clazz, String textName)
   throws IOException {
  String path = (URLDecoder.decode(
    clazz.getResource(textName).toString(), "UTF-8")).substring(6);  
  return path;
 }

public void createNewsIndex(String path,String textName) throws Exception {
  String newsId = "0";
  
  newsId = readText(TaskAction.class, textName);
  if (null ==newsId || "".equals(newsId))
   newsId = "0";

  // 使用paoding中文分析器
  Analyzer analyzer = new PaodingAnalyzer();
  FSDirectory directory = FSDirectory.getDirectory(path);
  System.out.println(directory.toString());
  IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName));
  Document doc = new Document();

  // 从业务逻辑层读取大于当前id的信息
  List list = newsManageService.getNewsBigId(Integer.parseInt(newsId));
  Iterator iterator = list.iterator();
  News news = new News();
  while (iterator.hasNext()) {
   doc = new Document();
   news = (News) iterator.next();
   doc.add(new Field("id", "" + news.getId(), Field.Store.YES,
     Field.Index.UN_TOKENIZED));
   doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES,
     Field.Index.TOKENIZED));
   doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES,
     Field.Index.TOKENIZED));
   doc.add(new Field("details", ""
     + Constants.splitAndFilterString(news.getDetails()),
     Field.Store.YES, Field.Index.TOKENIZED,
     Field.TermVector.WITH_POSITIONS_OFFSETS));
   doc.add(new Field("addtime", "" + news.getAddtime(),
     Field.Store.YES, Field.Index.TOKENIZED));
   doc.add(new Field("keywords", "" + news.getKeywords(),
     Field.Store.YES, Field.Index.TOKENIZED));
   System.out.println("Indexing file " + news.getName() + "...");
   articleId = String.valueOf(news.getId());
   try {
    writer.addDocument(doc);
   } catch (IOException e) {
    e.printStackTrace();
   }
  }
  // 优化并关闭
  writer.optimize();
  writer.close();

  // 将我索引的最后一篇文章的id写入文件
  String content = WriteText(TaskAction.class,
    textName, newsId);
 } 

public boolean isEmpty(Class clazz, String textName) throws Exception {
  String articleId = "0";
  boolean isEmpty = true;
  articleId = ContentReader.readText(clazz, textName);
  if (null == articleId || "".equals(articleId))
   articleId = "0";
  if (!articleId.equals("0"))
   isEmpty = false;
  System.out.println(clazz.getName()+" "+isEmpty);
  return isEmpty;
 }

//该方法参考了paoding中example中的一个方法。
public String readText(Class clazz, String textName)
   throws IOException {
  InputStream in = clazz.getResourceAsStream(textName);
  Reader re = new InputStreamReader(in, "UTF-8");
  char[] chs = new char[1024];
  int count;
  String content = "";
  while ((count = re.read(chs)) != -1) {
   content = content + new String(chs, 0, count);
  }
  return content;
 }

public String WriteText(Class clazz, String textName, String text)
   throws IOException {
  String path = (URLDecoder.decode(
    clazz.getResource(textName).toString(), "UTF-8")).substring(6);
  System.out.println(path);
  File file = new File(path);
  BufferedWriter bw = new BufferedWriter(new FileWriter(file));
  String temp = text;
  bw.write(temp);
  bw.close();
  return temp;
 }

2)进行搜索

public void searchIndex(String path, String keywords) throws Exception {
  String[] FIELD = { "title", "details" };
  String QUERY = keywords;

  Analyzer analyzer = new PaodingAnalyzer();
  FSDirectory directory = FSDirectory.getDirectory(path);
  IndexReader reader = IndexReader.open(directory);
  String queryString = QUERY;
  BooleanClause.Occur[] flags = new BooleanClause.Occur[] {
    BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };
  Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags,
    analyzer);

  Searcher searcher = new IndexSearcher(directory);
  query = query.rewrite(reader);
  System.out.println("Searching for: " + query.toString());
  Hits hits = searcher.search(query);

  NewsDTO news = new NewsDTO();
  String highLightText = "";

  for (int i = 0; i < hits.length(); i++) {

   Document doc = hits.doc(i);
   String title1 = doc.get("title");
   String contents1 = doc.get("details");

   SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter(
     "", "");

   Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
     new QueryScorer(query));
   highlighter.setTextFragmenter(new SimpleFragmenter(200));

   if (contents1 != null) {
    TokenStream tokenStream = analyzer.tokenStream("details",
      new StringReader(contents1));
    highLightText = highlighter.getBestFragment(tokenStream,
      contents1);
   }
   news = new NewsDTO();
   news.setId(Integer.parseInt(doc.get("id")));
   news.setName(doc.get("title"));
   news.setDetails(highLightText);
   news.setAddtime(doc.get("addtime"));
   news.setAuthor(doc.get("author"));
   searchResultItem.add(news);
  }
  reader.close();

 }

核心代码已经基本完成了，还有一个加亮显示，非常不错的哦。

3）再来一个定时创建索引：
定义一下bean

<"myTask" class="edu.cumt.jnotnull.action.TaskAction">    
        "newsManageService">    
            "newsManageService" />    
                    
   
    "entity"   
        class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean">    
        "targetObject">    
            "myTask" />    
            
        "targetMethod">    
            createIndex    
                    
   
    "cron"   
        class="org.springframework.scheduling.quartz.CronTriggerBean">    
        "jobDetail">    
            "entity" />    
            
        "cronExpression">    
            0 0-5 2 * * ?           
   
    "no"   
        class="org.springframework.scheduling.quartz.SchedulerFactoryBean">    
        "triggers">    
                
                "cron" />