用htmlparser包进行网页去噪音

最新推荐文章于 2024-10-19 13:46:41 发布

behappy373

最新推荐文章于 2024-10-19 13:46:41 发布

阅读量2.7k

点赞数 1

文章标签： string exception html null import 浏览器

本文链接：https://blog.csdn.net/behappy373/article/details/4249915

版权

import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.filters.OrFilter; import org.htmlparser.nodes.TextNode; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.NodeFilter; import org.htmlparser.Node; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.File; class Test { private static String ENCODE = "GB2312";//定义编码字符集 /** * *@author nobody * @param szFileName * @return */ public static String readFile( String szFileName )//文件读入函数 { try { BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) ); String szContent=""; String szTemp; while ( (szTemp = bis.readLine()) != null) { szContent+=szTemp+"/n"; } bis.close(); return szContent; } catch( Exception e ) { System.out.println("error2"); return ""; } } /** * * @param tag * @param content * @return */ public static StringBuffer delTag(String tag,StringBuffer content)//去除特定标签内容 { String beginTag="<"+tag; String endTag="</"+tag+">"; int pos1=0; int pos2=0; while((pos2=content.indexOf(beginTag,0))!=-1) { pos1=content.indexOf(endTag,pos2)+endTag.length()-1; if(pos1>pos2) { content=content.delete(pos2, pos1); } else { pos1=content.lastIndexOf("</"); if(pos1>pos2) { content=content.delete(pos2, pos1); content=content.append(tag+"></body></html"); } else { content=content.delete(pos2, content.length()); content=content.append("</body></html"); } } } return content; } /** * * @param str * @param filecontent * @return * @throws ParserException */ public static StringBuffer readhtml(StringBuffer str,String filecontent) throws ParserException {//对html进行去噪. StringBuffer sb=new StringBuffer(filecontent); sb=delTag("script",sb);//去除标签script的内容 filecontent=sb.toString(); Parser parser=Parser.createParser(filecontent,ENCODE); NodeFilter textFilter=(NodeFilter) new NodeClassFilter(TextNode.class); NodeFilter linkFilter=(NodeFilter) new NodeClassFilter(LinkTag.class); int beforeLen=0; NodeList nodeList=null; OrFilter lastFilter=new OrFilter(); lastFilter.setPredicates(new NodeFilter[]{textFilter,linkFilter}); nodeList=parser.parse(lastFilter); Node[] nodes=nodeList.toNodeArray(); for(int i=0;i<nodes.length;++i) { Node anode=nodes[i]; String textLine=""; if(anode instanceof TextNode) { TextNode textnode=(TextNode)anode; textLine=textnode.toPlainTextString().trim(); textLine=textLine.replace("p;", "");//去除特定文本 textLine=textLine.replace(">", ""); textLine=textLine.replace("<", ""); textLine=textLine.replace(" ", ""); textLine=textLine.replace("&nbs", ""); textLine=textLine.replace(""", ""); if(textLine.compareTo("不支持flasg")==0) continue; if(textLine.indexOf("©")!=-1||textLine.indexOf("copyright")!=-1) { if(i>nodes.length-10) break; } if(textLine.length()>3) { if(beforeLen==0||(beforeLen!=0&&(beforeLen/textLine.length()<4))) { str.append(textLine+"/n"); } beforeLen=textLine.length(); } } else if(anode instanceof LinkTag) { if(i<nodes.length-1) i++; } } return str; } /** * * @param args * @throws Exception */ public static void main(String[] args) throws Exception { StringBuffer mainContent=new StringBuffer(); String szContent =readFile("E://test"); System.out.println(readhtml(mainContent,szContent)); } }

以上是自己查看别人的入门教程写下来,算是copy啦....

网页去噪

从网站上抓取的网页是HTML格式的文件，其中包含了很多的标签文本，而其中的主要内容部分只是其中的一部分。把网页上文本（即可以显示在浏览器的文本）过滤出来可以大大减少文件的大小。一般情况下一个网页中正文部分很少有链接的，而在网页中像导航条、广告信息等一些链接显示的文本显然不是网页中的主要内容，并且一般情况下这些链接的总数基本上就是网页中链接的总数，因此把链接对应的锚点文字去除也可以达到去噪的目的。在实验中使用的是开源HTML分析包HTMLParser，使用这个包可以很快的得到HTML中的链接和文本。实验中使用的是过滤的方法来访问每一个Parser树的节点，使用TextNode和LinkTag来过滤Parser生成的树。一般的情况是一个链接后对应的是一个锚点的文本，根据这个特点，我们在检测到一个链接后就认为往下的文本就是它的节点，并将这个文本删除，最终获得的文本和正文的内容更相近。使用这种方法可以去除很大部分的链接内容，而这些内容和网页的正文是没有很大关联的。这种方法对在正文中有很多链接并且链接的文本内容很短的情况处理不是很好,会造成内容丢失。

图片是去噪过程.