Lisa测试集文档切分

最新推荐文章于 2022-02-24 22:47:27 发布

yidi

最新推荐文章于 2022-02-24 22:47:27 发布

阅读量1.1k

点赞数

分类专栏： Other

本文链接：https://blog.csdn.net/anzelin_ruc/article/details/8216236

版权

Other 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

原文地址，转载请注明出处：http://blog.csdn.net/anzelin_ruc/article/details/8216236 ©安泽林

最近信息检索要求做一个全文检索系统，给定了一个Lisa测试集，我们的任务就是根据该测试集，在Lucene之上做搜索优化，并基于Lucene做一个简单的搜索引擎。这个搜索引擎的具体实现暂时不谈，后面有时间我再将我的成果奉献给大家。这里下载下来的Lisa测试集分为14个文件，每个文件中有500个文档（最后三个除外），如果将这500个文档手工从某个文件中截取出来显然是不切实际的，因此我就针对该测试集写了一个Java程序，用来将各个文档分开，并以文档编号命名，依次将生成的6000多个文档放入一个指定的路径下。通过简单的修改，这个Java程序就可以应用到其他需要文档分割的场合，该代码如下所示：

import java.io.File;
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;

public class IO {
	public void readIn(String inFilePath) throws Exception {
		Scanner cin = new Scanner(new File(inFilePath));
		String outFilePath = null;
		StringBuffer content = new StringBuffer();
		while (cin.hasNext()) {
			String tmp = cin.nextLine();
			if (!tmp.equals("")) {
				if (!tmp.startsWith("*")) {
					if (tmp.startsWith("Document")) {
						content.append(tmp+"\n");
						String index = tmp.substring(9, tmp.length());
						System.out.println("已输出："+index+".txt");
						outFilePath = new String("F:\\lisa_document\\Document"+ index + ".txt");
					} else {
						content.append(tmp+"\n");
					}
				} else{
					writeOut(outFilePath, content.toString());
					content =new StringBuffer("");
				}
				
			}else{
				content.append(tmp+"\n");
			}
		}
		System.out.println("已完成:100%,请查看输出目录");

	}
	
	public void writeOut(String outFilePath, String content) {
		try {
			PrintWriter printWriter = new PrintWriter(outFilePath);
			printWriter.write(content);
			printWriter.close();
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}
	}

	public static void main(String[] args) throws Exception {
		IO io = new IO();
		io.readIn(args[0]);
	}

}

本文博客园地址： http://www.cnblogs.com/anzelin/archive/2012/11/23/2784293.html

个人博客地址：http://sunny614.sinaapp.com/?p=73

yidi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lisa测试集文档切分

原文地址，转载请注明出处：http://blog.csdn.net/anzelin_ruc/article/details/8216236 ©安泽林最近信息检索要求做一个全文检索系统，给定了一个Lisa测试集，我们的任务就是根据该测试集，在Lucene之上做搜索优化，并基于Lucene做一个简单的搜索引擎。这个搜索引擎的具体实现暂时不谈，后面有时间我再将我的成果奉献
复制链接

扫一扫

专栏目录