现有一段html代码如下
<p>20200310<img src="/downloadImg?id=7566876320816252412" title="835637e39dc0bdb4c29f5e1adb5528a.png" alt="835637e39dc0bdb4c29f5e1adb5528a.png"/></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true" title="本地数据库连接.txt">本地数据库连接 .txt</a></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true" title="工作安排计划.xls">工作安排计划.xls</a></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true" title="项目开发帮助文档.docx">项目开发帮助文档.docx</a></p><p><br/></p>
需要从中取出src="/downloadImg?id=7566876320816252412",href="/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true",href="/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true"和href="/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true"中的id。想起之前自己写爬虫用过的Jsoup可以解析html,我的做法如下
String html = "<p>20200310<img src=\"/downloadImg?id=7566876320816252412\" title=\"835637e39dc0bdb4c29f5e1adb5528a.png\" alt=\"835637e39dc0bdb4c29f5e1adb5528a.png\"/></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true\" title=\"本地数据库连接 -自己.txt\">本地数据库连接 -自己.txt</a></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true\" title=\"工作安排计划.xls\">工作安排计划.xls</a></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true\" title=\"187项目开发帮助文档.docx\">187项目开发帮助文档.docx</a></p><p><br/></p>";
Document document = Jsoup.parse(html);
Elements imgElements = document.select("img[title]");//获取带src属性的img标签
Elements aElements = document.select("a[href]");//获取带有href的a标签
List<String> imgStrings = new ArrayList<String>();
List<String> aStrings = new ArrayList<String>();
for(Element element:imgElements) {
String src = element.attr("src");
imgStrings.add(src);
}
for(Element element:aElements) {
String href = element.attr("href");
aStrings.add(href);
}
for(String aString:aStrings) {
System.out.println("附件id:"+aString.substring(aString.indexOf("d/")+2, aString.indexOf("?")));
}
for(String imgString:imgStrings) {
System.out.println("图片id:"+imgString.substring(16));
}
结果如下