Jsoup解析html字符串

最新推荐文章于 2024-11-11 16:00:03 发布

做个有素质的屌人

最新推荐文章于 2024-11-11 16:00:03 发布

阅读量220

点赞数

分类专栏： java

本文链接：https://blog.csdn.net/DSJ1996/article/details/104790668

版权

java 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

现有一段html代码如下

<p>20200310<img src="/downloadImg?id=7566876320816252412" title="835637e39dc0bdb4c29f5e1adb5528a.png" alt="835637e39dc0bdb4c29f5e1adb5528a.png"/></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true" title="本地数据库连接.txt">本地数据库连接 .txt</a></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true" title="工作安排计划.xls">工作安排计划.xls</a></p><p style="line-height: 16px;"><img src="http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif"/><a style="font-size:12px; color:#0066cc;" target="_blank" href="/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true" title="项目开发帮助文档.docx">项目开发帮助文档.docx</a></p><p><br/></p>

需要从中取出src="/downloadImg?id=7566876320816252412"，href="/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true"，href="/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true"和href="/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true"中的id。想起之前自己写爬虫用过的Jsoup可以解析html，我的做法如下

		String html = "<p>20200310<img src=\"/downloadImg?id=7566876320816252412\" title=\"835637e39dc0bdb4c29f5e1adb5528a.png\" alt=\"835637e39dc0bdb4c29f5e1adb5528a.png\"/></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/6405467898840828674?&hasDownload=true\" title=\"本地数据库连接 -自己.txt\">本地数据库连接 -自己.txt</a></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/3250930489916801852?&hasDownload=true\" title=\"工作安排计划.xls\">工作安排计划.xls</a></p><p style=\"line-height: 16px;\"><img src=\"http://localhost/static/ueditor/dialogs/attachment/fileTypeImages/icon_txt.gif\"/><a style=\"font-size:12px; color:#0066cc;\" target=\"_blank\" href=\"/tellEditor/previewOrDownload/8833048008381252472?&hasDownload=true\" title=\"187项目开发帮助文档.docx\">187项目开发帮助文档.docx</a></p><p><br/></p>";
		Document document = Jsoup.parse(html);
		Elements imgElements = document.select("img[title]");//获取带src属性的img标签
		Elements aElements = document.select("a[href]");//获取带有href的a标签
		List<String> imgStrings = new ArrayList<String>();
		List<String> aStrings = new ArrayList<String>();
		for(Element element:imgElements) {
			String src = element.attr("src");
			imgStrings.add(src);
		}
		for(Element element:aElements) {
			String href = element.attr("href");
			aStrings.add(href);
		}
		
		for(String aString:aStrings) {
			System.out.println("附件id："+aString.substring(aString.indexOf("d/")+2, aString.indexOf("?")));
		}
		for(String imgString:imgStrings) {
			System.out.println("图片id："+imgString.substring(16));
		}