I need to extract paragraphs (like title in StackOverflow) from an html file.
I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
解决方案
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);