该文介绍了一种抽取博客正文和评论的方法.
Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog Post and Comment Extraction Using Information Quantity of Web Format. In Proceedings of the 2008 Asia Information Retrieval Symposium(AIRS-2008), January 15-28, 2008, Harbin, China.
Abstract: With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.
Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog Post and Comment Extraction Using Information Quantity of Web Format. In Proceedings of the 2008 Asia Information Retrieval Symposium(AIRS-2008), January 15-28, 2008, Harbin, China.
Abstract: With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.