I'm actually developping a text parser in Java and I was asked to enhance it by parsing HTML with it.
The parser's purpose is to divide the file parsed into 3 other files, one with all the words contained in the file, one with all sentences and the other with all questions.
The *.txt part works perfectly, but I got a problem when parsing HTML.
I create a temporary file with *.txt extension and pass it in my text parser, but if I pass an URL with HTML file linked which is formed like this:
... some HTML here ...
This is a question ?
This is a sentence .
... some other text ...