I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &. How to achieve this!?
thanks
EDIT:
Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.
lines.replaceAll("[^a-zA-Z]", " ")
Note: I am getting lines from a txt file.
Any other suggestions plss?!
解决方案
Maybe this will work:
String noHTMLString = htmlString.replaceAll("\\<.>","");
It uses regular expressions to remove all HTML tags in a string.
More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.
Hope this helps.