下边这个方法是获取内容中(比如网页)所有的指定标签的src集合,htmlstr为内容,type为标签名称,比如图片标签名称为img(代码块中注释部分即是以img为示例),视频标签为video等
/**
* 获取指定类型的src值的集合
* @param htmlStr
* @param type 标签名称
* @return
*/
public static Set<String> getSrcStr(String htmlStr,String type) {
Set<String> srcs = new HashSet<String>();
String src = "";
Pattern p_src;
Matcher m_src;
// String regEx_img = "<img.*src=(.*?)[^>]*?>"; //图片链接地址
String regEx_src = "<"+type+".*src\\s*=\\s*(.*?)[^>]*?>";
p_src = Pattern.compile
(regEx_src, Pattern.CASE_INSENSITIVE);
m_src = p_src.matcher(htmlStr);
while (m_src.find()) {
// 得到<img />数据
src = m_src.group();
// 匹配<img>中的src数据
Matcher m = Pattern.compile("src\\s*=\\s*\"?(.*?)(\"|>|\\s+)").matcher(src);
while (m.find()) {
srcs.add(m.group(1));
}
}
return srcs;
}
使用示例:
String str="12345<video src=\"http://vd4.bdstatic.com/mda-jkjf1ab31ekxafc4/sc/mda-jkjf1ab31ekxafc4.mp4?playlist=%5B%22hd%22%2C%22sc%22%5D\" width=\"100px\"></video>,<video src=\"http://vd4.bdstatic.com/mda-jkjf1ab31ekxafc4/sc/mda-jkjf1ab31ekxafc4.mp4?playlist=%5B%22hd%22%2C%22sc%22%5g\"></video>12345";
System.out.println(getSrcStr(str,"video"));
示例结果(为地址的set集合):
[http://vd4.bdstatic.com/mda-jkjf1ab31ekxafc4/sc/mda-jkjf1ab31ekxafc4.mp4?playlist=%5B%22hd%22%2C%22sc%22%5D, http://vd4.bdstatic.com/mda-jkjf1ab31ekxafc4/sc/mda-jkjf1ab31ekxafc4.mp4?playlist=%5B%22hd%22%2C%22sc%22%5g]