coreseek的资料真少啊,不人性化
官方内容一定要看,进行中...
http://www.tapy.org/sphinx1.0/sphinx.html
http://www.coreseek.cn/docs/coreseek_4.1-sphinx_2.0.1-beta.html
mongoDB不是coreseek/sphinx支持的格式,因此我用python读入mongoDB数据,然后转成python数据源,用coreseek建立索引
# -*- coding:utf-8 -*-
# author: Hao
import pymongo
class MainSource(object):
def __init__(self, conf):
self.conf = conf
self.idx = 0
self.data=[]
self.conn = None
self.cur = None
def GetScheme(self): #获取结构,docid、文本、整数
return [
('id' , {'docid':True, } ),
('subject', { 'type':'text'} ),
('context', { 'type':'text'} ),
('published', {'type':'integer'} ),
('author_id', {'type':'integer'} ),
]
def GetFieldOrder(self): #选择被搜索字段,并决定字段的优先顺序
return [('subject', 'context')]
def Connected(self): #如果是数据库,则在此处做数据库连接
if self.conn == None:
#读入数据库数据
conn = pymongo.Connection('localhost',27017)
db = conn.test
dbCollection = db.database
commentset=[]
cursor = dbCollection.find()
count=0
for commentPiece in cursor:
sentence = commentPiece['content']
commentset.append(sentence)
for i in range(len(commentset)):
self.data.append({'id':i+1, 'subject':u'number'+str(i),'context':commentset[i], 'published':00001, 'author_id':1 })
pass
def NextDocument(self,_): #取得每一个文档记录的调用
if self.idx < len(self.data):
item = self.data[self.idx]
self.id = item['id'] #'docid':True
self.subject = item['subject'].encode('utf-8')
self.context = item['context'].encode('utf-8')
self.published = item['published']
self.author_id = item['author_id']
self.idx += 1
return True
else:
return False
if __name__ == "__main__": #直接访问演示部分
conf = {}
source = MainSource(conf)
source.Connected()
while source.NextDocument():
print "id=%d, context=%s" % (source.id, source.context)
pass
#eof
刚刚一直无法将文件的id设置成功,一开始怀疑是
'id' , {'docid':True, }
的格式要求不对,后来发现这个好像是固定写法,根据他人的格式,这种写法,docid完全可以是1、2、3、4、5...
纠结啊,网上找不到资料,所以试出来之后写个blog给大家
self.data.append({'id':i+1, 'subject':u'number'+str(i),'context':commentset[i], 'published':00001, 'author_id':1 })
出现问题的原因是因为,docid必须是非零正整数,这里 i in range()是从0开始的计数,所以,一直是报错状态
另外,一开始试图将mongoDB的数据放在__init__方法中,失败了,还以为是必须要放在Connected这个方法内,事实证明,也是因为id从0开始报错的原因。如果不嫌丑陋,
#读入数据库数据
conn = pymongo.Connection('localhost',27017)
db = conn.test
dbCollection = db.database
commentset=[]
cursor = dbCollection.find()
count=0
for commentPiece in cursor:
sentence = commentPiece['content']
commentset.append(sentence)
for i in range(len(commentset)):
self.data.append({'id':i+1, 'subject':u'number'+str(i),'context':commentset[i], 'published':00001, 'author_id':1 })
这一段可以放在__init__方法中
继续探索,good luck to you and me