html两个模块平行,URL模式与HTML结构相结合的平行网页获取方法

上海积分吴老师

于 2021-06-17 12:45:06 发布

阅读量147

点赞数

文章标签： html两个模块平行

摘要：

Parallel corpus is the fundamental resource for statistical machine translation, cross-lingual information retrieval and others information processing technologies. Although the amount of parallel data on the web is continually increasing, the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts. This paper presents a new parallel web pages mining approach, which combines URL patterns and HTML structure together. First, we use HTML structure to recursively visit parallel pages. Then, URL patterns are used to optimize the traverse sequence of parallel web site topology. Thus an efficient and accurate parallel pages mining system is relaized. Compared with traditional approach, experiments on two parallel web sites(www.un.org and www.gov.hk) show that this approach saves more than 50% processing timeand improves 15% accuracy, resulting a significant increase in the translation quality of MT System.Key wordsparallel pages mining; parallel corpus; URL pattern; HTML structure

展开