我试图索引一个数据库的大量文章,它是用拉丁语1编码的。我已经用charset解决了编码问题,但是我无法将添加到索引中的每一行。在
我试过:
1)writer.add_document(Id = unicode(row["Id"]),Body = unicode(row["Body"]), Name = unicode(row["Name"]), Brand = unicode(row["Brand"]), Familia = unicode(row["Familia"]))
这将索引文档,但不考虑索引标签。在
2)
^{pr2}$
此报告add_document()只接受1个参数(给定2个)错误
以下是完整代码:# Open a writer for the index
with ix.writer() as writer:
con= mdb.connect(host="myhost",
user="myuser",
passwd="pass",
db="db",
charset="utf8",
use_unicode=True)
with con:
cur = con.cursor(mdb.cursors.DictCursor)
#cur.execute("SELECT Id, Body, Name, Brand, Familia FROM articles")
rows = cur.fetchall()
for row in rows:
print row
doc6 = row["Brand"]
doc2 = row["Name"]
print doc2
print 'body'
doc3 = row["Body"].replace("á", "a")
doc3 = doc3.replace("é", "e")
doc3 = doc3.replace("í", "i")
doc3 = doc3.replace("ó", "o")
doc3 = doc3.replace("ú", "u")
doc3 = doc3.replace("ñ", "n")
doc3 = doc3.replace(""", "")
print doc3
print 'familia'
doc4 = row["Familia"]
print doc4
print 'id'
doc5 = row["Id"]
print doc5
writer.add_document(Id = unicode(row["Id"]),Body = unicode(row["Body"]), Name = unicode(row["Name"]), Brand = unicode(row["Brand"]), Familia = unicode(row["Familia"]))
#
# doc = unicode(doc5),unicode(doc3), unicode(doc2), unicode(doc6), unicode(doc4)
# writer.add_document(doc) #reports add_document() takes exactly 1 argument (2 given) Error
#writer.add_document(Id = unicode(doc5),Body = unicode(doc3), Name = unicode(doc2), Brand = unicode(doc6), Familia = unicode(doc4))
numdocs = ix.doc_count_all()
print "docs indexed =", numdocs
提前谢谢大家!在