html两个模块平行,URL模式与HTML结构相结合的平行网页获取方法

摘要:

Parallel corpus is the fundamental resource for statistical machine translation, cross-lingual information retrieval and others information processing technologies. Although the amount of parallel data on the web is continually increasing, the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts. This paper presents a new parallel web pages mining approach, which combines URL patterns and HTML structure together. First, we use HTML structure to recursively visit parallel pages. Then, URL patterns are used to optimize the traverse sequence of parallel web site topology. Thus an efficient and accurate parallel pages mining system is relaized. Compared with traditional approach, experiments on two parallel web sites(www.un.org and www.gov.hk) show that this approach saves more than 50% processing timeand improves 15% accuracy, resulting a significant increase in the translation quality of MT System.Key wordsparallel pages mining; parallel corpus; URL pattern; HTML structure

展开

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值