[lucene]关于对xml的lucene索引

最新推荐文章于 2017-10-31 16:51:50 发布

meteorlWJ

最新推荐文章于 2017-10-31 16:51:50 发布

阅读量2.3k

点赞数

分类专栏： XML Lucene 文章标签： lucene xml string import 文档 null

XML 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

Lucene

5 篇文章 0 订阅

订阅专栏

对xml文档建立索引时，如果不进行特殊处理，会把其中的一些标记符号也做了索引，这样检索时就会检索出冗余的结果。比如xml的根元素名可能都叫做某一个名字，在搜索这个名字的时候，就会把所有的数据列出，这样显然不符合要求。
 因此，我们要对xml文档的整份数据进行分析，将其中的有用的信息提取出来，而把一些元素名，属性名等不必要的东西剔除。以下给出的范例是继承自org.xml.sax的一个类HandlerBase而来的。在这个类中，我们对xml的脚本分析后，提取其有意义的文字public String getEndStr() ，然后再对这些文字做索引，从而避免搜索出现不相关结果。

 xml文档分析器范例：

import org.xml.sax.AttributeList;
import org.xml.sax.SAXException;
import org.xml.sax.HandlerBase;

import javax.xml.parsers.*;

import java.io.IOException;
import java.io.StringBufferInputStream;

/**
 * xml格式的字符串的解析器，提取xml中的有用文本
 * @author 草莽熊窝
 * @version 1.0
 */
public class XMLHandlerSAX
 extends HandlerBase {
 /**
 * 构造函数
 * @param xmlStr String ：xml格式的字符串
 * @throws ParserConfigurationException
 * @throws SAXException
 * @throws IOException
 */
 public XMLHandlerSAX(String xmlStr) throws
 ParserConfigurationException, SAXException, IOException {
 if ( (xmlStr != null) && (!xmlStr.equals(""))) {
 //解析xml文档的字符串类型，支持中文xml文档
 SAXParserFactory spf = SAXParserFactory.newInstance();
 SAXParser parser = spf.newSAXParser();
 xmlStr = new String(xmlStr.getBytes("GB2312"), "ISO-8859-1");
 StringBufferInputStream sbis = new StringBufferInputStream(xmlStr);
 parser.parse(sbis, this);
 }
 else {
 xmlStr = "";
 }
 }

 /**
 * call at document start
 */
 public void startDocument() {
 xmlBuffer.setLength(0);
 xmlBuffer.append(" ");
 }

 /**
 * call at element start
 * @param localName String
 * @param atts AttributeList
 * @throws SAXException
 */
 public void startElement(String localName, AttributeList atts) throws
 SAXException {
 //注释掉部分是提取元素属性值，取消注释后则不屏蔽属性的内容
 /*
 for (int i = 0; i < atts.getLength(); i++) {
 //空格为了将不同词分开（主要为拉丁字母的语言而设）
 xmlBuffer.append(atts.getValue(i) + " ");
 }
 */

 elementBuffer.setLength(0);
 }

 /**
 * call when cdata found
 * @param text char[]
 * @param start int
 * @param length int
 */
 public void characters(char[] text, int start, int length) {
 elementBuffer.append(text, start, length);
 }

 /**
 * call at element end
 * @param localName String
 * @throws SAXException
 */
 public void endElement(String localName) throws SAXException {
 //空格为了将不同词分开（主要为拉丁字母的语言而设）
 xmlBuffer.append(elementBuffer + " ");
 }

 /**
 * call at document end
 */
 public void endDocument() {
 xmlStr = xmlBuffer.toString();
 }

 /**
 * 获得过滤后的xml内容
 * @return String
 */
 public String getEndStr() {
 if (xmlStr == null) {
 xmlStr = "";
 }
 return xmlStr;
 }

 private StringBuffer elementBuffer = new StringBuffer();
 private StringBuffer xmlBuffer = new StringBuffer();
 private String xmlStr = null;
}

本文转自: http://www.tianyablog.com/blogger/post_show.asp?BlogID=114714&PostID=1252555

meteorlWJ

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
[lucene]关于对xml的lucene索引

对xml文档建立索引时，如果不进行特殊处理，会把其中的一些标记符号也做了索引，这样检索时就会检索出冗余的结果。比如xml的根元素名可能都叫做某一个名字，在搜索这个名字的时候，就会把所有的数据列出，这样显然不符合要求。 因此，我们要对xml文档的整份数据进行分析，将其中的有用的信息提取出来，而把一些元素名，属性名等不必要的东西剔除。以下给出的范例是继承自org.xml.sax的一个类Handl
复制链接

扫一扫