programming collective intelligence读书笔记三

searchengine:  programming collective intelligence
一个简化的searchengine骨架...深度遍历link...
试了一下,公司网速太慢,爬不动
...searchengine的瓶颈在带宽和并发计算 programming collective intelligence
python的BeautifulSoup库很不错,php,ruby就没这么好的库,提取link要自己写匹配函数

>>> import searchengine
>>> c=searchengine.crawler()
>>> c=searchengine.crawler('')
>>> c.crawl()
indexing http://kiwitobes.com/wiki/
indexing http://kiwitobes.com/wiki/Citeseer.html
indexing http://kiwitobes.com/wiki/Insert_%2528SQL%2529.html
indexing http://kiwitobes.com/wiki/Spacecraft_propulsion.html
indexing http://kiwitobes.com/wiki/Noctis.html
indexing http://kiwitobes.com/wiki/Methods.html
indexing http://kiwitobes.com/wiki/32_%2528number%2529.html
... programming collective intelligence


searchengine.py
------------------------
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
ignorewords=set(['the','of','to','and','a','in','is','it'])
baseUrl = set(['http://kiwitobes.com/wiki/'])

class crawler:

def __init__(self,dbname):
pass
def __del__(self):
pass
def dbcommit(self):
pass
def getentryid(self,table):
return None
def gettextonly(self,soup):
return None
def separatewords(self,text):
return None
def addlinkref(self,urlFrom,urlTo,linkText):
pass
def createindextables(self):
pass
def addtoindex(self,url,soup):
print 'indexing %s' %url
def isindexed(self,url):
return False
def crawl(self,pages=baseUrl,depth=2):
for i in range(depth):
newpages = set()
for page in pages:
try:
c = urllib2.urlopen(page)
except:
print "could not open %s" %page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)

links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url = urljoin(page,link['href'])
if not self.isindexed(url):
newpages.add(url)
text = self.gettextonly(link)
self.addlinkref(page,url,text)
self.dbcommit()
pages=newpages

-------------- programming collective intelligence
chenjinlai
2008-05-07
programming collective intelligence


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值