解析HTML 文本性能最高的方法

最新推荐文章于 2024-03-18 15:46:55 发布

weixin_30312563

最新推荐文章于 2024-03-18 15:46:55 发布

阅读量86

点赞数

原文链接：http://www.cnblogs.com/NanguoCoffee/archive/2013/03/23/2977883.html

版权

解析爬取后的html文件方法有很多。

有DOM方式，xpath方法解析，和最原始的字符串解析。

对比性能、简易性和内存消耗，用原始的字符串解析是最佳的方式。

/**

* 将content中beginStr和endStr之间(包含beginStr和endStr)的数据分离处理放入List中比如

* String content ="xxxxaa1111bbaa2222bbcc55aa";

* String beginStr ="aa";

* String endStr ="bb";

* List <String> strs = split(content,beginStr,endStr);

* strs的内容为["aa1111bb","aa2222bb"]

* @param content

* 原字符串

* @param beginStr

* 开始字符串

* @param endStr

* 结束字符串

* @return

public static List<String> split(String content, String beginStr, String endStr) {

if (isEmpty(beginStr) || isEmpty(endStr)) {

throw new IllegalArgumentException( " beginStr or endStr is empty!" );

}

if (content == null || content.length() == 0) {

return Collections.emptyList();

}

List<String> strs = new ArrayList<String>();

int pos = content.indexOf(beginStr);

int beginLen = beginStr.length();

while (pos != -1) {

int prepos = pos;

pos = content.indexOf(endStr, pos + beginLen);

if (pos != -1) {

strs.add(content.substring(prepos, pos));

} else {

break ;

}

pos = content.indexOf(beginStr, pos);

}

return strs;

}

/**

* 取出str中beginStr和endStr之间的数据的第一对数据。

* 类似于Str.substring(str.indexOf(beginStr),str

* .indexOf(endStr));不过需要根据参数判断是否包含beginStr和endstr 如： String content

* ="xxxxaa1111bbaa2222bbcc55aa";

* subString(content,"aa","bb",false,false) : "1111"

* subString(content,"aa","bb",true,false) : "aa1111"]

* subString(content,"aa","bb",false,true) : "1111bb"

* subString(content,"aa","bb",true,true) : "aa1111bb"

* @param str

* 原字符串

* @param beginStr

* 开始字符串

* @param endStr

* 结束字符串

* @param isIncluseBeginStr

* 是否包含开始字符串

* @param isIncludeEndStr

* 是否包含结束字符串

* @return

public static String fistSubString(String str, String beginStr, String endStr, boolean isIncluseBeginStr,

boolean isIncludeEndStr) {

int index = str.indexOf(beginStr);

if (index != -1) {

int endIndex = str.indexOf(endStr, index + beginStr.length());

if (endIndex != -1) {

if (!isIncluseBeginStr) {

index += beginStr.length();

}

if (isIncludeEndStr) {

endIndex += endStr.length();

}

return str.substring(index, endIndex);

}

return null ;

}

/**

* 判断str是否为空 return str == null || str.length() == 0;

* @param str

* @return

public static boolean isEmpty(String str) {

return str == null || str.length() == 0;

}

转载于:https://www.cnblogs.com/NanguoCoffee/archive/2013/03/23/2977883.html

weixin_30312563

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
解析HTML 文本性能最高的方法

解析爬取后的html文件方法有很多。有DOM方式，xpath方法解析，和最原始的字符串解析。对比性能、简易性和内存消耗，用原始的字符串解析是最佳的方式。/** * 将content中beginStr和endStr之间(包含beginStr和endStr)的数据分离处理放入List中比如 * String con...
复制链接

扫一扫