去掉HTML文本中的HTML标签:
public static String htmlToStr(String input) {
if (input == null) {
return null;
}
StringBuffer result = new StringBuffer();
boolean flag = true;
char[] a = input.toCharArray();
int length = a.length;
for (int i = 0; i < length; i++) {
if (a[i] == '<') {
flag = false;
continue;
}
if (a[i] == '>') {
flag = true;
continue;
}
if (flag == true) {
result.append(a[i]);
}
}
return result.toString();
}
进一步:在去掉标签以后,整个文本可能格式就比较乱了,我们可以适当的优化(这里仅仅是把p的结束标签换为\n),同时去掉里面的特殊符号(比如&5476;):
public static String htmlToStr(String input) {
StringBuffer result = new StringBuffer();
boolean flag = true;
if (input == null) {
return null;
}
char[] a = input.toCharArray();
int length = a.length;
StringBuffer bTemp = new StringBuffer();
for (int i = 0; i < length; i++) {
bTemp.append(a[i]);
if(bTemp.toString().equals("</p>")){
result.append("\n");
}
if (a[i] == '<') {
flag = false;
bTemp.delete(0, bTemp.length()-1);
continue;
}
if (a[i] == '>') {
flag = true;
continue;
}
if (a[i] == '&') {
flag = false;
continue;
}
if (a[i] == ';') {
flag = true;
continue;
}
if (flag == true) {
result.append(a[i]);
}
}
return result.toString();
}