python自制搜索引擎_python爬蟲第一課,制作搜索引擎

最新推荐文章于 2023-12-04 10:14:40 发布

顾茈陌里

最新推荐文章于 2023-12-04 10:14:40 发布

阅读量337

点赞数

文章标签： python自制搜索引擎

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_30095343/article/details/113675875

版权

from BeautifulSoup import *

from urlparse import urljoin

ignaorewords=set(['the','of','to','and','a','in','is','it'])

我們的搜索引擎基於關鍵詞, 所以將連詞,冠詞忽略

下面的代碼是爬蟲, 將網頁的文本數據存儲到我們的sqlite中, 大家看不懂也沒有關系, 知道這些函數是干什么的就行了

from sqlite3 import dbapi2 as sqlite

import urllib2

class crawler:

def __init__(self,dbname):

self.con=sqlite.connect(dbname)

#連接並建立數據庫, dbname 隨意, 'xxx.db'就可以

def __del__(self):

self.con.close()

def dbcommit(self):

self.con.commit()

def getentryid(self,table,field,value,createnew=True):

cur=self.con.execute(

"select rowid from %s where %s='%s'" %(table,field,value))

res=cur.fetchone()

if res==None:

cur=self.con.execute(

"insert into %s (%s) values ('%s')" % (table,field,value))

return cur.lastrowid

else:

return res[0]

def addtoindex(self,url,soup):

if self.isindexed(url): return

print 'Indexing',url

#Get words

text=self.gettextonly(soup)

words=self.separatewords(text)

#Get URL id

urlid=self.getentryid('urllist','url',url)

# Link word to url

for i in range(len(words)):

word=words[i]

if word in ignaorewords: continue

wordid=self.getentryid('wordlist','word',word)

self.con.execute("insert into wordlocation(urlid,wordid,location) \

values(%d,%d,%d)" % (urlid,wordid,i))

def gettextonly(self,soup):

v=soup.string

if v==None:

最低0.47元/天解锁文章

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python自制搜索引擎_python爬蟲第一課,制作搜索引擎

from BeautifulSoup import *from urlparse import urljoinignaorewords=set(['the','of','to','and','a','in','is','it'])我們的搜索引擎基於關鍵詞, 所以將連詞,冠詞忽略下面的代碼是爬蟲, 將網頁的文本數據存儲到我們的sqlite中, 大家看不懂也沒有關系, 知道這些函數是干什么的就行了fr...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。