The method has been found out after several days' hard work. Very simple and high-developing efficiency.
Steps:
1 Download the required web page
2 Tidy Webpage into standard xhtml file
2.1 Translate Entities &--> &
2.2 Strong tag pair <span> <meta> <br> <link> <img>
2.3 Add XML features, PI, encoding....
2.4 The Quote Symbol
3 Retag current xhtml wtih followiing rules:
method 1: add "_d(num)" to current tags
where the (num) is node depth from document root.
method 2: add "_tl(num)" to current tags
where the (num) is the table depth of current node relative to node body.
Both rules are applied to all nodes execpt, Preprocessor Instructions , comments nodes and script nodes.
4 Write out the re-tagged xhtml as xml file
Remove namespace of xhml from here, otherwise xslt can not work well
5 write corresponding xlst file
Notice here: Clear your special template or element
6 Write perfect schema file
7 Transform to get the final xml file.
Make sure that you have got correct character encoding. Otherwise, MSXML will fail.
Nice steps.
My question is: how to access attribute value in ? <a href=""> </a>