2011年04月_fxjtoday

06月 05月 04月 03月 02月

原创 boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源码分析

开源Java模块boilerpipe(1.1.0), http://code.google.com/p/boilerpipe/ 使用例子, URL url = new URL("http://www.example.com/some-location/index.html"); // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you String text = ArticleExtractor.IN

2011-04-13 13:30:00 7331

原创 decruft(A library to extract meaningful data from a webpage) 源码分析

开源Python模块, http://code.google.com/p/decruft/ decruft使用example, from decruft import Document #import urllib2 #f = urllib2.open('url') f = open('index.html', 'a') print Document(f.read()).summary() 分析一下summary的实现, 总体来说并没有什么复杂的理论, 主要就是根据段落中的word number, link

2011-04-13 11:33:00 3556

原创 Python标准模块logging

开发Python, 一直以来都是使用自己编写的logging模块. 比较土...... 今天发现python的标准模块的这个功能做的挺好, 记录一下, 以后使用模块来进行logging. 对于这个模块的介绍网上也很多, 我也不用自己写了, 比较好的如下, http://crazier9527.iteye.com/blog/290018 Python的标准logging模块 http://blog.endlesscode.com/2010/06/03/python-logging-module/ P

2011-04-07 16:00:00 75371 6

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人