htmlpaser打造个性化的爬虫程序第三天

最新推荐文章于 2021-06-26 20:54:52 发布

hymcn

最新推荐文章于 2021-06-26 20:54:52 发布

阅读量595

点赞数

分类专栏： HtmlParser 文章标签： string filter class 任务扩展

本文链接：https://blog.csdn.net/hymcn/article/details/7451376

版权

HtmlParser 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

任务描述：文本内容抽取

	/*
	 * 通过tag名称和attribute名称来抽取文本
	 * @author hym
	 */
	public static String getTextByTagNameAndAttributeName(final String tagName,final String attributeName)
	{
		String temp = "";
		try {
			Parser parser = new Parser("http://sthaboutme.sinaapp.com/?p=66");
			NodeFilter filter = new NodeFilter()
			{
				@Override
				public boolean accept(Node node) {
					// TODO Auto-generated method stub
					boolean flag = false;
					if((node instanceof Tag))
					{
						Tag tag = (Tag)node;
						if(tag.getTagName().endsWith(tagName)&&tag.getAttribute("class")!=null&&tag.getAttribute("class").endsWith(attributeName))
						{
							flag = true;
						}
					}
					return flag;
				}	
			};
			NodeList nlist = parser.extractAllNodesThatMatch(filter);
			if(nlist.size()>0)
			{
				System.out.println(((TagNode)nlist.elementAt(0)).toHtml());
				//temp = nlist.elementAt(0).toHtml();
			}
		} catch (ParserException e) {
			e.printStackTrace();
		}
		return temp;
	}
	

上边的方法对网页比较规范的网页很有效，对于不规范的网页可以通过截取字符串的方法抽取。当然，上边的方法只是抽取第一条匹配到的文本，由于使用场景的不同，欢迎对方法进行扩展。

hymcn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
htmlpaser打造个性化的爬虫程序第三天

任务描述：文本内容抽取 /* * 通过tag名称和attribute名称来抽取文本 * @author hym */ public static String getTextByTagNameAndAttributeName(final String tagName,final String attributeName) { String temp = ""; tr
复制链接

扫一扫