HTML中定义框架分割,基於HTML文件佈局之網頁分割演算法

面对大量动态网页数据,有效利用成为挑战。传统的视觉化网页分割算法(VIPS)在处理DHTML页面时遇到困难。本文提出一种新方法,结合HTML文件布局特性,解决在DHTML网页中无法找到视觉分割线的问题,以改善网页区块分割的准确性。
摘要由CSDN通过智能技术生成

摘要:

依據統計資料,截自2010年來共有1.13億個網站存在過,其中有99.9%是在近15年間成立的, 面對這樣龐大又高替換的網頁資料,如何有效地使用是一件很重要的事. 對於大量變動態的資料,通常尋求搜尋引擎的協助來正確定位資料; 對於已知位址的資料,為了增加使用效率,則會使用資料萃取的技術.而不管是搜尋引擎或資料萃取工具,要對複雜的網頁進行分析,首要就是要對網頁作區塊分類或標記,以濾除噪音 ( Noise ) 區塊及提取各主題( Topic )區域之本文區塊,也就是網頁區塊分割( Page Segmentation) . 2003年微軟團隊發表視覺化網頁分割演算法(Vision-based page segmentation: VIPS )後,很多網頁分割研究多參考了視覺化分割技術.但在近幾年來,越來越多網頁的頁面框架設計,採用DHTML技術為主時,原始的VIPS的方法在使用上,便出現當初設計時沒有顧及的小缺陷,雖然 之後的研究,出現很多組合型態的頁面分割演算法來彌補使用上的不足.但因為是採用其它特性的演算法來彌補VIPS, 所以這部份切割區塊也就喪失視覺化分割的特性.本文提出一個方法,在以視覺化分割為基礎上,帶入網頁文件布局特性(HTML Rendering-Based),以解決視覺化區塊分割在DHTML網頁上,可能找不到視覺化分割線( Separator )的問題.According to the statistical datas, Up to 2010, a total of 113 million websites existed, of which 99.9% was established nearly 15 years, the face of such large and high replacement page data, how to effectively use is a very important matter. For the information that we don't know its location, we usually use search engine to help us to find it out. And for the information that we do know where it is, we use data extraction to increase the efficiency. And whether it is a search engine or information extraction tool, to analyze the complex web, the first steps is to split the Web Page to provide subject area of this location, It's a important thing that how to use this huge database efficiently. Since 2003 the team released Microsoft Visual Web segmentation algorithm (Vision-based page segmentation: VIPS), many papers are mostly used segmentation based on visual segmentation, However, in recent years, more and more web page Layout design, using DHTML technology-based, the original method of VIPS in the use, they are in the original design did not take into account small defects, though after the study, there are many page segmentation algorithm combined patterns to make up for the use of deficiency. But since they are using other features of the algorithm to make up for VIPS, so this part of the Visual cues is losing the characteristics of visual segmentation,This paper presents a method, in order to split based on visualization, into the HTML document Rendering features, to solve the visual segmentation in DHTML pages, you may not find the visual Separator problems.

展开

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值