django - whoosh 全文检索，jieba 分词

最新推荐文章于 2023-06-06 13:05:16 发布

Ace96

最新推荐文章于 2023-06-06 13:05:16 发布

阅读量788

点赞数

分类专栏： django

本文链接：https://blog.csdn.net/qq_16033847/article/details/103386220

版权

django 专栏收录该内容

40 篇文章 0 订阅

订阅专栏

whoosh 全文检索

纯Python编写的全文搜索引擎，对于小型的站点，whoosh已经足够使
用
Whoosh速度很快，使用纯Python，因此它将在Python运行的任何地方
运行，而无需编译器
与许多其他搜索库相比，Whoosh创建的索引很小。
Whoosh中所有索引的文本都必须是unicode。
使用Whoosh，您可以使用索引文档存储任意Python对象。
Whoosh并不是真正的搜索引擎，它是用于创建搜索引擎的程序员库

安装第三方库

pip install django-haystack 全文检索的框架
pip install whoosh 全文检索的引擎
pip install jieba 中文分词器包

激活全文检索框架

INSTALLED_APPS = [
	...
	'haystack'
]

# 配置全文检索引擎
HAYSTACK_CONNECTIONS = {
	'default': {
		'ENGINE':
		'haystack.backends.whoosh_backend.WhooshEngine',
		'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
	}
}

# 自动维护索引
HAYSTACK_SIGNAL_PROCESSOR='haystack.signals.RealtimeSignalProc

# 每页显示条数，默认为20
HAYSTACK_SEARCH_RESULTS_PER_PAGE = 10

编写模型类

from django.db import models


class User(models.Model):
	username = models.CharField(max_length=100)
	password = models.CharField(max_length=32)
	
class Meta:
	db_table = "t_user"

应用下新增索引类

模块名 search_indexes.py (固定的，不能有错误)

from haystack.indexes import SearchIndex, Indexable, CharField
from .models import User


class UserSearchIndex(SearchIndex, Indexable):
	# 查询的域名、必须提供、并且设置 document=True
	text = CharField(document=True, use_template=True)
	
	def get_model(self):
		return User
		
	def index_queryset(self, using=None):
		return self.get_model().objects.all()

设置索引字段

在 templates 模板中，新建如下目录

templates -> search -> indexes -> appName -> 模型名_text.txt

PS : 模型名全小写，_text 不能随意、该文件主要对索引维护起作用
文件的内容如下

{{ object.username }}
{{ object.password }}

username , password 为模型中对应的属性，该文件中定义要全文
检索的属性，不需要将模型中的所有属性都进行定义

配置搜索接口路由

url('^search/', include('haystack.urls'))

编写搜索框模板页面

<form action="/search/" method="get">
	<input type="text" name="q" />
	<input type="submit" value="搜索"/>
</form>

name = q 是必须的，不能修改；
q 会从索引库中查找要查询的内容，要查询的内容包含的字段为引字段定义的 {{object.xxx}} 决定；
提交的地址为搜索接口路由配置的地址，请求的方式为 GET

处理检索的结果

搜索接口请求返回一个模板，该模板为位置及其他信息如下

templates -> search -> search.html

在 search.html 中，获取查询到的数据

后台返回的模板中，包含的常见Key如下

{
	'page': page,
	'paginator': paginator,
	'query': query,
}

page ：Django 分页插件的page对象
paginator : 分页对象
query : 查询参数
{{ page.object_list }} 返回一个列表，可以对其进行遍历并显示相关信息
```
{% for p in page.object_list %}
{{ p.object }}
{% endfor %}
```

通过 p.object 获取模型对象、处理高亮显示

{% load hightlight %}
{% for p in page.object_list %}
	{% highlight p.object.username with query %}
{% endfor %}

添加样式：

<style>
	span.highlighted { color: red;}
</style>

默认采用 span 标签，可以进行修改，
{% highlight p.object.username with query html_tag 'em' %}

高亮显示显示 … 省略了部分内容

解决方案：

修改 haystack.utils.hightlighting.py 160行左右：
源码： highlighted_chunk = '%s' % highlighted_chunk
更改为：highlighted_chunk = '%s%s' % (self.text_block [:start_offset], highlighted_chunk)

中文分词器Jieba

拷贝 haystack.backends.whoosh_backend.py 到项目中，并重命名为 whoosh_cn_backend.py
修改 whoosh_cn_backend.py 文件

导入中文分词器
from jieba.analyse import ChineseAnalyzer
找到
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=field_class.boost, sortable=True)
修改为
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost, sortable=True)

修改 HAYSTACK_CONNECTIONS 中的引擎

HAYSTACK_CONNECTIONS = {
	'default': {
		'ENGINE': 'mwhoosh.whoosh_cn_backend.WhooshEngine',
		'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
	}
}

重建索引
python manage.py rebuild_index

Ace96

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
django - whoosh 全文检索，jieba 分词

whoosh 全文检索纯Python编写的全文搜索引擎，对于小型的站点，whoosh已经足够使用Whoosh速度很快，使用纯Python，因此它将在Python运行的任何地方运行，而无需编译器与许多其他搜索库相比，Whoosh创建的索引很小。Whoosh中所有索引的文本都必须是unicode。使用Whoosh，您可以使用索引文档存储任意Python对象。Whoosh并不是真正的搜...
复制链接

扫一扫