Whoosh使用手册（Schema）（三）

最新推荐文章于 2023-06-06 13:05:16 发布

twsxtd

最新推荐文章于 2023-06-06 13:05:16 发布

阅读量2.6k

点赞数

分类专栏： python 文章标签： schema Schema Whoosh Python 全文索引

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

About schema and fields

schema指明了在一个index中的document的field
每个document可以有多个field，比如title，content，url，data等等
有些field可以被索引，有些field可以和document一起存起来以便使得在搜索结果的时候可以显示出来。有些field既可以被索引也可以被存储。
schema就document中所有可能field的集合，每一个document可能仅仅使用schema中field的某个子集，例如，一个简单的检索邮件的schema可能像这样：from_addr,to_addr,subject,body,和attachments,attachments列出与这封邮件相关的邮件，对于没有相关邮件的你可以省略它

Built-in field types(内建field类型)

Whoosh提供了很多非常有用的与定义field类型：

whoosh.fields.TEXT

这种类型是正文文本，它索引文本（并且选择性地存储）并且存储位置项以便搜索

TEXT使用 StandardAnalyzer吗？默认是这样。当然也可以指明不同的分词器，给构造器一个关键词参数就可以
TEXT(analyzer=analysis.StemmingAnalyzer()),看看TextAnalysis？
TEXT field存储每一个索引项的位置信息以便可以短语搜索，如果你不需要，你可以关闭它以节省空间：TEXT（parse=False）
TEXT默认不存储，通常你不希望在索引里面存储正文信息，通常你有被索引的document给就诶过提供链接，因此你不需要在索引里面存储它。然而，在某些情况下面他们可能有用，用TEXT（stored=True）来指明这写文本需要被存储

whoosh.fields.KEYWORD

这种类型为空格或者逗号分词设计。它是被索引并且可搜索的（选择性存储），为了节省空间他不支持短语搜索
为了在索引里面存储值，可以在构造器里面使用 stored=True，可以使用 lowercase=True来使文本自动变成小写
默认是空格分词的，可以用commas=True来使用commas分词（允许空格）
如果用户要使用keyword都所，使用scorable=True

whoosh.fields.ID

ID field类型简单地索引这个field里面的整个值作为一个单元（也即它不会分成若干项）这种类型不存储出现频率信息，因此它十分紧凑，但不适合计数
像文件路径，URL，data或者catalog（必须当成一个整体并且每个document只有一个值）的时候可以使用ID这个域
默认ID是不存储的，使用ID（stored=True）指明需要存储，例如你可能需要存储url以便在结果中可以提供链接

whoosh.fields.STORED

这个field和document一并存储，但是不可索引和不可搜索，这在你提供搜索结果的更多信息但是不需要搜索它的时候有用

whoosh.fields.NUMERIC

这个field以一种紧凑可排序的格式存储整型，长整型，实型数据

whoosh.fields.DATETIME

这个field以一种紧凑可排序的格式存储日期型数据

whoosh.fields.BOOLEAN

这个field简单地存储布尔值并且支持用户用 yes,no,true,false,1,0,t或者f搜索

whoosh.fields.NGRAM

TBD.
专业的用户可以自己创造他们自己的field类型

Creating a Schema

from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(from_addr=ID(stored=True),
                to_addr=ID(stored=True),
                subject=TEXT(stored=True),
                body=TEXT(analyzer=StemmingAnalyzer()),
                tags=KEYWORD)

如果没有使用一个构造器的关键字参数，可以省略后面的括号，（例如fieldname=TEXT代替fieldname=TEXT()）Whoosh可以为你实例化

你也可以选择使用继承SchemaClass类来创建一个Schema类

from whoosh.fields import SchemaClass, TEXT, KEYWORD, ID, STORED

class MySchema(SchemaClass):
        path = ID(stored=True)
        title = TEXT(stored=True)
        content = TEXT
        tags = KEYWORD

你可以给create_in()或者create_index()函数一个类作为参数而不是他的实例

Modifying the schema after indexing

在你创建index之后，你可以使用add_field()和remove_field()方法来添加或者删除fields。这写方法属于Writer对象

writer = ix.writer()
writer.add_field("fieldname", fields.TEXT(stored=True))
writer.remove_field("content")
writer.commit()

(如果你要使用相同的writer修改schema结构或者向其中添加documents，你必须先调用add_field()或者remove_field()方法)

这些方法Index对象也有，但是当你在Index对象上调用的时候，这些Index对象简单地创建一个writer然后调用相应的方法，然后提交，因此如果需要添加超过一个field，使用writer对象本身更有效率

ix.add_field("fieldname",field.KEYWORD)

在fileddb后端，删除一个field简单地删除schema中的那个schema，索引文件不会变小，那个field里面的数据会保存直到你优化它。优化可以使索引表更紧凑并且移除和已经删除field的相关信息

writer = ix.writer()
writer.add_field("uuid", fields.ID(stored=True))
writer.remove_field("path")
writer.commit(optimize=True)

数据是以field名存储在磁盘文件上面的，不要在优化一个schema之前添加一个和之前删除field相同的field：

writer = ix.writer()
writer.delete_field("path")
# Don't do this!!!
writer.add_field("path", fields.KEYWORD)

（Whoosh将来的版本可能会自动处理这个错误）

Dynamic fields

动态fields可以使用通配符名字将field联系起来
可以使用add()方法（关键字参数glob为真）添加dynamic fields到一个新的schema：

schema = fields.Schema(...)
# Any name ending in "_d" will be treated as a stored
# DATETIME field
schema.add("*_d", fields.DATETIME(stored=True), glob=True)

在一个已经存在的索引上面设置dynamic fields，使用indexWriter.add_field方法就像你添加一个通常的field一样，保证glob参数为True

writer = ix.writer()
writer.add_field("*_d", fields.DATETIME(stored=True), glob=True)
writer.commit()

删除一个dynamic fields可以使用IndexWriter.remove_field()方法（用glob作为名字）

writer = ix.writer()
writer.remove_field("*_d")
writer.commit()

例如。为了使document包含以_id结尾的任意field名字，并且将他与所有的IDfield类型联系起来：

schema = fields.Schema(path=fields.ID)
schema.add("*_id", fields.ID, glob=True)

ix = index.create_in("myindex", schema)

w = ix.writer()
w.add_document(path=u"/a", test_id=u"alfa")
w.add_document(path=u"/b", class_id=u"MyClass")
# ...
w.commit()

qp = qparser.QueryParser("path", schema=schema)
q = qp.parse(u"test_id:alfa")
with ix.searcher() as s:
        results = s.search(）

Advanced schema setup

Field boosts

你可以为一个field指定field boost，这是一个相乘器对于这个fiedl中找到的项，例如，让title field中的项得分两倍于其他域里面的项
schema = Schema(title=TEXT(field_boost=2.0), body=TEXT)

Field types

上面列举的field类型都是fields.FieldType的子类，FieldType是一个相当简单的类，他的属性包含这个field定义的行为

Attribute	Type	Description
format	fields.Format	Defines what kind of information a field recordsabout each term, and how the information is storedon disk.
vector	fields.Format	Optional: if defined, the format in which to storeper-document forward-index information for this field.
scorable	bool	If True, the length of (number of terms in)the field ineach document is stored in the index. Slightly misnamed,since field lengths are not required for all scoring.However, field lengths are required to get properresults from BM25F.
stored	bool	If True, the value of this field is storedin the index.
unique	bool	If True, the value of this field may be used toreplace documents with the same value when the usercalls`document_update()`on an `IndexWriter`.

预定义的field类型的构造器有允许你自己定制的参数部分，例如：
大多数预定义类型有一个stored关键字参数来设置FieldType.stored
TEXT（）构造器有一个analyzer关键字参数传递给一个格式化对象

Formats

一个Format对象定义field的记录关于每一项包含什么信息以及这写信息如何在磁盘上存储
例如，Existence Format就像这样

Doc
10
20
30

然而Position format可能想这样

Doc	Positions
10	`[1,5,23]`
20	`[45]`
30	`[7,12]`

检索代码传递unicode串，Format对象调用他的analyzer来将串分成token，然后为每个token编码

Class name	Description
Stored	A “null” format for fields that are stored but not indexed.
Existence	Records only whether a term is in a document or not, i.e. itdoes not store term frequency. Useful for identifier fields(e.g. path or id) and “tag”-type fields, where the frequencyis expected to always be 0 or 1.
Frequency	Stores the number of times each term appears in each document.
Positions	Stores the number of times each term appears in each document,and at what positions.

STORED field类型使用Stored format（什么都不做，因此不索引）ID类型使用Existence format，KEYWORD类型使用Frequency format，TEXT使用Position类型如果它以phrase=True的形式被实例化（默认）或者Freqyency format如果phrase=False

另外，下列格式可能为专业用户提供提供某些方便，但是现在Whoosh没有实现

Class name	Description
DocBoosts	Like Existence, but also stores per-document boosts
Characters	Like Positions, but also stores the start and end characterindices of each term
PositionBoosts	Like Positions, but also stores per-position boosts
CharacterBoosts	Like Positions, but also stores the start and end characterindices of each term and per-position boosts

Vectors

主要的索引是一个反向索引。他将document里面出现的项与document建立映射关系。在存储forwad index（向前索引）的时候也可能有用，也称之为term vector，它将document和其中出现的项建立映射关系
例如，假设一个反向索引像这样：

Term	Postings
apple	`[(doc=1, freq=2), (doc=2, freq=5), (doc=3, freq=1)]`
bear	`[(doc=2, freq=7)]`

相应的向前索引，或者称之为term vector可能是这样：

Doc	Postings
1	`[(text=apple, freq=2)]`
2	`[(text=apple, freq=5), (text='bear', freq=7)]`
3	`[(text=apple, freq=1)]`

twsxtd

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Whoosh使用手册（Schema）（三）

About schema and fieldsschema指明了在一个index中的document的field每个document可以有多个field，比如title，content，url，data等等有些field可以被索引，有些field可以和document一起存起来以便使得在搜索结果的时候可以显示出来。有些field既可以被索引也可以被存储。schema就documen
复制链接

扫一扫