上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。
(HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)
本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧
jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler
并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。
Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。
先来看看关键类的UML模型
ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:
/** * Decorator base class for the {@link ContentHandler} interface. This class * simply delegates all SAX events calls to an underlying decorated handler * instance. Subclasses can provide extra decoration by overriding one or more * of the SAX event methods. */ public class ContentHandlerDecorator extends DefaultHandler { /** * Decorated SAX event handler. */ private ContentHandler handler; /** * Creates a decorator for the given SAX event handler. * * @param handler SAX event handler to be decorated */ public ContentHandlerDecorator(ContentHandler handler) { assert handler != null; this.handler = handler; } /** * Creates a decorator that by default forwards incoming SAX events to * a dummy content handler that simply ignores all the events. Subclasses * should use the {@link #setContentHandler(ContentHandler)} method to * switch to a more usable underlying content handler. */ protected ContentHandlerDecorator() { this(new DefaultHandler()); } /** * Sets the underlying content handler. All future SAX events will be * directed to this handler instead of the one that was previously used. * * @param handler content handler */ protected void setContentHandler(ContentHandler handler) { assert handler != null; this.handler = handler; } @Override public void startPrefixMapping(String prefix, String uri) throws SAXException { try { handler.startPrefixMapping(prefix, uri); } catch (SAXException e) { handleException(e); } } @Override public void endPrefixMapping(String prefix) throws SAXException { try { handler.endPrefixMapping(prefix); } catch (SAXException e) { handleException(e); } } @Override public void processingInstruction(String target, String data) throws SAXException { try { handler.processingInstruction(target, data); } catch (SAXException e) { handleException(e); } } @Override public void setDocumentLocator(Locator locator) { handler.setDocumentLocator(locator); } @Override public void startDocument() throws SAXException { try { handler.startDocument(); } catch (SAXException e) { handleException(e); } } @Override public void endDocument() throws SAXException { try { handler.endDocument(); } catch (SAXException e) { handleException(e); } } @Override public void startElement( String uri, String localName, String name, Attributes atts) throws SAXException { try { handler.startElement(uri, localName, name, atts); } catch (SAXException e) { handleException(e); } } @Override public void endElement(String uri, String localName, String name) throws SAXException { try { handler.endElement(uri, localName, name); } catch (SAXException e) { handleException(e); } } @Override public void characters(char[] ch, int start, int length) throws SAXException { try { handler.characters(ch, start, length); } catch (SAXException e) { handleException(e); } } @Override public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { try { handler.ignorableWhitespace(ch, start, length); } catch (SAXException e) { handleException(e); } } @Override public void skippedEntity(String name) throws SAXException { try { handler.skippedEntity(name); } catch (SAXException e) { handleException(e); } } @Override public String toString() { return handler.toString(); } /** * Handle any exceptions thrown by methods in this class. This method * provides a single place to implement custom exception handling. The * default behaviour is simply to re-throw the given exception, but * subclasses can also provide alternative ways of handling the situation. * * @param exception the exception that was thrown * @throws SAXException the exception (if any) thrown to the client */ protected void handleException(SAXException exception) throws SAXException { throw exception; } }
该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法
接下来看具体的装饰类BodyContentHandler的源码
/** * Content handler decorator that only passes everything inside * the XHTML <body/> tag to the underlying handler. Note that * the <body/> tag itself is <em>not</em> passed on. */ public class BodyContentHandler extends ContentHandlerDecorator { /** * XHTML XPath parser. */ private static final XPathParser PARSER = new XPathParser("xhtml", XHTMLContentHandler.XHTML); /** * The XPath matcher used to select the XHTML body contents. */ private static final Matcher MATCHER = PARSER.parse("/xhtml:html/xhtml:body/descendant::node()"); /** * Creates a content handler that passes all XHTML body events to the * given underlying content handler. * * @param handler content handler */ public BodyContentHandler(ContentHandler handler) { super(new MatchingContentHandler(handler, MATCHER)); } /** * Creates a content handler that writes XHTML body character events to * the given writer. * * @param writer writer */ public BodyContentHandler(Writer writer) { this(new WriteOutContentHandler(writer)); } /** * Creates a content handler that writes XHTML body character events to * the given output stream using the default encoding. * * @param stream output stream */ public BodyContentHandler(OutputStream stream) { this(new WriteOutContentHandler(stream)); } /** * Creates a content handler that writes XHTML body character events to * an internal string buffer. The contents of the buffer can be retrieved * using the {@link #toString()} method. * <p> * The internal string buffer is bounded at the given number of characters. * If this write limit is reached, then a {@link SAXException} is thrown. * * @since Apache Tika 0.7 * @param writeLimit maximum number of characters to include in the string, * or -1 to disable the write limit */ public BodyContentHandler(int writeLimit) { this(new WriteOutContentHandler(writeLimit)); } /** * Creates a content handler that writes XHTML body character events to * an internal string buffer. The contents of the buffer can be retrieved * using the {@link #toString()} method. * <p> * The internal string buffer is bounded at 100k characters. If this write * limit is reached, then a {@link SAXException} is thrown. */ public BodyContentHandler() { this(new WriteOutContentHandler()); } }
最后是用过调用父类的构造函数初始化被装饰的对象