webharvest:
1.get a web page source formats in XML format
<html-to-xml> <http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>
</html-to-xml>
or just get html format
<http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>
2. SimpleDateFormat
EEE, dd MMM yyyy hh:mm:ss Z
dd-MM-yyyy HH:mm a
3.<template>${sys.fullUrl(rooturl,commenter_name)}</template>
XPATH
1.data((//font[@class='subject'])[1])
2.//td[@class='tablerow' and @valign='top' and @style='height: 80px; width: 82%']/font[position() > 1]
3.a[.,'1']
Regular Expression
<content>([\\w\\W]*?)<content>
<post>(.*?)</post>
/\\d{4}/\\d{1,2}/\\d{1,2}/ <!-- such as /2009/12/3/-->
1.get a web page source formats in XML format
<html-to-xml> <http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>
</html-to-xml>
or just get html format
<http url="${sys.fullUrl(rooturl,nexturl)}" charset="ISO-8859-1"/>
2. SimpleDateFormat
EEE, dd MMM yyyy hh:mm:ss Z
dd-MM-yyyy HH:mm a
3.<template>${sys.fullUrl(rooturl,commenter_name)}</template>
XPATH
1.data((//font[@class='subject'])[1])
2.//td[@class='tablerow' and @valign='top' and @style='height: 80px; width: 82%']/font[position() > 1]
3.a[.,'1']
Regular Expression
<content>([\\w\\W]*?)<content>
<post>(.*?)</post>
/\\d{4}/\\d{1,2}/\\d{1,2}/ <!-- such as /2009/12/3/-->