The Flask Mega-Tutorial 之 Chapter 16：Full Text Search （全文搜索）

最新推荐文章于 2023-06-16 22:16:04 发布

Kungreye

最新推荐文章于 2023-06-16 22:16:04 发布

阅读量394

点赞数 1

分类专栏： Flask

本文链接：https://blog.csdn.net/Kungreye/article/details/80879699

版权

Flask 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

小引

为 Microblog 添加 全文搜索 ，对于给定的搜索词（search term），返回包含搜索词的所有 posts，并按照相关度降序排列。

Intro to Full-Text Search Engines

1、开源 full-text search 引擎：

Elasticsearch
Apache Solr
Whoosh
Xapian
Sphinx

2、具备搜索能力的 database：

SQLite、MySQL、PostgreSQL
MongoDB、CouchDB

relational db 虽有搜索功能，但由于 SQLAlchemy 不支持这个功能，所以必须自己写原生 SQL 语句，或者找到一个库能够实现 text search 的 high-level acess 同时与 SQLAlchemy 协同。

Elasticsearch 作为 ELK 栈（Elasticsearch-Logstash-Kibana，for indexing logs）的一员，有很高的流行度，选择Elasticsearch 用于本项目。

注：将 text indexing 和 searching 相关的 funcs，封装到单独的 module 中。若之后需要改换 search engine ，则只需改写此 module 的相关 funcs 即可。

Installing Elasticsearch

1、安装 Elasticsearch 之前，须先安装 JDK 8

How to Install Java 8 on Debian 9/8/7 via PPA
How to Install JAVA 8 on Ubuntu 18.04/16.04, LinuxMint 18/17
Ubuntu 安装 JDK 7 / JDK8 的两种方式

1-1 Add Java 8 PPA

Create a new Apt configuration file， /etc/apt/sources.list.d/java-8-debian.list,

sudo vim /etc/apt/sources.list.d/java-8-debian.list

添加如下内容

deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main

引入 GPG key（用于 package 安装前的验证）.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886

1-2 安装 Java 8

sudo apt-get update
sudo apt-get install oracle-java8-installer

1-3 验证 Java 安装成功

设定版本

sudo apt-get install oracle-java8-set-default

The apt repository provides package oracle-java8-set-default to set Java 8 as default Java version.
验证版本

$ java -version

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

1-4 搭建 JAVA_HOME 和 JRE_HOME 环境变量

修改环境变量（针对 user ）

sudo vim ~/.bashrc

如需要针对系统，则更改 /etc/environment
在 ~/.bashrc 追加内容

# set oracle jdk environment
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=${JAVA_HOME}/jre

使环境变量马上生效

source ~/.bashrc

2、安装 Elasticsearch

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb.sha512
shasum -a 512 -c elasticsearch-6.3.0.deb.sha512 
sudo dpkg -i elasticsearch-6.3.0.deb

Compares the SHA of the downloaded Debian package and the published checksum, which should output elasticsearch-{version}.deb: OK.

checksum_SHA_verification

3、启动 / 关闭 Elasticsearch

Running / Stopping Elasticsearch with systemd

sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service

如果想开机启动Elasticsearch，则

sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service

验证 Elasticsearch 运行

http://localhost:9200

这里写图片描述

4、安装 Elasticsearch 对应的 python 客户端

(venv) $ pip install elasticsearch

注：更新 requirements.txt

Elasticsearch Tutorial

1、建立 Elasticsearch connection

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch('http://localhost:9200')

实例化 + 传参URL

2、将 data （JSON ）写入 index： es.index()

>>> es.index(index='my_index', doc_type='my_index', id=1, body={'text': 'this is a test'})
>>> es.index(index='my_index', doc_type='my_index', id=2, body={'text': 'a second test'})

index， Elasticsearch 的 storage container
doc_type ，存储类型，一个 index 可以存储多种类型
id，unique
body，JSON object with the data，包含 field 及 data

3、search： es.search()

>>> es.search(index='my_index', doc_type='my_index',
... body={'query': {'match': {'text': 'this test'}}})

注意 body 的格式，{'query': {'match': {<field>: <expression>}}}

response 格式，是 python dict

{
    'took': 1,
    'timed_out': False,
    '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
    'hits': {
        'total': 2, 
        'max_score': 0.5753642, 
        'hits': [
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '1',
                '_score': 0.5753642,
                '_source': {'text': 'this is a test'}
            },
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '2',
                '_score': 0.25316024,
                '_source': {'text': 'a second test'}
            }
        ]
    }
}

4、删除 index

>>> es.indices.delete('my_index')

注：如果是删除其中某个 id，则

es.delete(index=index, doc_type=index, id=<id>)

Elasticsearch Configuration

1、ELASTICSEARCH_URL

config.py: elasticsearch configuration.

class Config(object):
    # ...
    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL')

更新 .env

ELASTICSEARCH_URL=http://localhost:9200

2、初始化 Elasticsearch

由于 Elasticsearch 不是 Flask extension，故不能在没有 app instance 的情况下，在 global scope 中实例化。

app/__init__.py: Elasticsearch instance.

# ...
from elasticsearch import Elasticsearch

# ...

def create_app(config_class=Config):
    app = Flask(__name__)
    app.config.from_object(config_class)

    # ...
    app.elasticsearch = Elasticsearch([app.config['ELASTICSEARCH_URL']]) \
        if app.config['ELASTICSEARCH_URL'] else None

    # ...

若未配置 URL 环境变量，则 app.elasticsearch is None

A Full-Text Search Abstraction

抽象的目的

不局限于某个 Elasticsearch，便于更换 search engine
一般化 model，不局限于 Post

1、为 Model 添加 __searchable__=[]

为需要 indexing 的 Model 添加 __searchable__ 属性，列入需要添加到 index 的 fields。

app/models.py:

class Post(db.Model):
    __searchable__ = ['body']
    # ...

注： _searchable_ 只是一个变量，不会产生任何 behavior，仅用于辅助稍后的 funcs。

2、封装 app / search.py

from flask import current_app

def add_to_index(index, model):
    if not current_app.elasticsearch:
        return
    payload = {}
    for field in model.__searchable__:
        payload[field] = getattr(model, field)
    current_app.elasticsearch.index(index=index, doc_type=index, id=model.id,
                                    body=payload)

def remove_from_index(index, model):
    if not current_app.elasticsearch:
        return
    current_app.elasticsearch.delete(index=index, doc_type=index, id=model.id)

def query_index(index, query, page, per_page):
    if not current_app.elasticsearch:
        return [], 0
    search = current_app.elasticsearch.search(
        index=index, doc_type=index,
        body={'query': {'multi_match': {'query': query, 'fields': ['*']}},
              'from': (page - 1) * per_page, 'size': per_page})
    ids = [int(hit['_id']) for hit in search['hits']['hits']]
    return ids, search['hits']['total']

application 通过 app/search.py 与 elasticsearch 建立连接，便于之后的更换 search engine

注：

id=model.id，使得 Elasticsearch 与 SQLAlchemy 两个 db 的 unique id 相同，便于之后的定向 delete 及 search CASE 排序。
add_to_index()，兼具 add 及 update 的功能
multi_match， search across multiple fields.
'fields': ['*']，tell Elasticsearch to look in all the fields (listed in __searchable__), i.e. search the entire index.

This is useful to make this function generic, since different models can have different field names in the index.
无 SQLAlchemy 的 paginate() 可用，须自己计算 'from': (page - 1) * per_page
用 list comprehension ，获取 IDs

3、测试

测试（测试前，须先添加相应的 posts ）

>>> from app.search import add_to_index, remove_from_index, query_index
>>> for post in Post.query.all():
...     add_to_index('posts', post)
>>> query_index('posts', 'one two three four five', 1, 100)
([15, 13, 12, 4, 11, 8, 14], 7)

清除测试内容

>>> app.elasticsearch.indices.delete('posts')

Integrating Searches with SQLAlchemy

app/search.py 中采用的方法，有两类问题：

1、query_index() 返回的结果之一，为 IDs，而不是 model objects

我们希望能直接拿到 model objects，这样可以传给 templates 来进行 rendering

Solution：根据 IDs，写出 SQL query 语句，提取到相应的 model objs

2、posts 添加/删除时，须显性地调用 add_to_index 及 remove_from_index

容易滋生 bug，使得 Elasticsearch 和 SQLAlchemy db 越来越不同步（async）

Solution：利用SQLAlchemy events，监听 db.session，使得 SQLAlchemy db 发生更改时，自动更新 Elasticsearch

为解决上述两类问题，创建一类 mixin class —— SearchableMixin

mixin 类将作为 SQLAlchemy —— Elasticsearch 的粘结层
当某个 Model 继承了 SearchableMixin 后, 将具备自动管理 associated full-text index.

1、app/models.py: SearchableMixin class.

from app.search import add_to_index, remove_from_index, query_index

class SearchableMixin(object):
    @classmethod
    def search(cls, expression, page, per_page):

    @classmethod
    def before_commit(cls, session):

    @classmethod
    def after_commit(cls, session):

    @classmethod
    def reindex(cls):

1-1 - search()

@classmethod
    def search(cls, expression, page, per_page):
        ids, total = query_index(cls.__tablename__, expression, page, per_page)
        if total == 0:
            return cls.query.filter_by(id=0), 0
        when = []
        for i in range(len(ids)):
            when.append((ids[i], i))
        return cls.query.filter(cls.id.in_(ids)).order_by(
            db.case(when, value=cls.id)), total

引入 app/search.py 中的 query_index()，其中参数 index = cls.__tablename__
when = [(ids[i], i)...]
返回的 cls.query.filter() 中，cls.id.in_(ids) 系 SQLAlchemy 语法（注：非 filter_by）
order_by 中，采用 CASE，依次将 when 每个tuple 中的 ids[id] 与 value 比较，当 cls.id == ids[id] 时，返回 tuple 中的 i 作为排序序号。

最终，search()返回的 model objects 按照 IDs 的顺序排列。

1-2- before_commit 和 after_commit

    @classmethod
    def before_commit(cls, session):
        session._changes = {
            'add': list(session.new),
            'update': list(session.dirty),
            'delete': list(session.deleted)
        }

    @classmethod
    def after_commit(cls, session):
        for obj in session._changes['add']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['update']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['delete']:
            if isinstance(obj, SearchableMixin):
                remove_from_index(obj.__tablename__, obj)
        session._changes = None

注：

一旦 SQLAlchemy db.session 出现改动，则将 objects 存储到 session._changes ={} 中
session 一旦提交，则无法通过 session 属性追踪（session.new/session.dirty/session.deleted）
db.session 改动时，session._changes 存储的不只是添加了 SearchableMixin的 Model，还有其他 Model 的 objects
db.session 提交后，after_commit 需要判断 session._changes 中的 obj 是不是 SearchableMixin 的 instance。
after_commit 中调用 add_to_index 及 remove_from_index 时，均为index=obj.__tablename__，不可用 cls.__tablename__（如果两类 Model 如 A 和 B 均继承了 SearchableMixin，且 A 有db.session 改动，而 B 没有，但 B.after_commit() 亦有效，此时 cls.__tablename__ 指向 B，所以应该用 obj.__tablename__ 保证始终指向真实提交的 Model）

1-3- reindex

    @classmethod
    def reindex(cls):
        for obj in cls.query:
            add_to_index(cls.__tablename__, obj)

Add all the model objects in the database to the search index.
注： cls.query 等同 cls.query.all()

1-4 db.event

sqlalchemy event

格式 sqlalchemy.event.listen(target, identifier, fn, *args, **kw)

db.event.listen(db.session, 'before_commit', SearchableMixin.before_commit)
db.event.listen(db.session, 'after_commit', SearchableMixin.after_commit)

测试

>>> Post.reindex()

>>> query, total = Post.search('one two three four five', 1, 5)
>>> total
7
>>> query.all()
[<Post five>, <Post two>, <Post one>, <Post one more>, <Post one>]

注：返回的 query 也是 SQLAlchemy query 对象，所以可以用 query.all()

query = cls.query.filter(cls.id.in_(ids)).order_by(db.case(when, value=cls.id))

希望把 search term 以 q 参数的方式传至 URL，以便直接访问搜索结果，类似： https://www.google.com/search?q=python

为把 Client 提交的 search term ，以 query string 的方式加入到 URL，则须 request method 为 GET。

POST ，用于提交 app 表单的 form data（前面章节已展示）
GET，在浏览器输入 URL 或者点击 link 时，用到的 request method

1、创建表单： app / main / forms.py: Search form.

from flask import request

class SearchForm(FlaskForm):
    q = StringField(_l('Search'), validators=[DataRequired()])

    def __init__(self, *args, **kwargs):
        if 'formdata' not in kwargs:
            kwargs['formdata'] = request.args
        if 'csrf_enabled' not in kwargs:
            kwargs['csrf_enabled'] = False
        super(SearchForm, self).__init__(*args, **kwargs)

只设一个 text field q ，未设 submit button（表单如果有 text field，点击 Enter 键则提交）

For a form that has a text field, the browser will submit the form when you press Enter with the focus on the field, so a button is not needed.
formdata，决定 Flask-WTF 从哪里获得 form submission，默认request.form 。‘GET’ 时，改为 request.args，使 Flask-WTF 从 query string 获得 formdata。
csrf_enabled，表单默认添加 CSRF protection，通过表单添加 CSRF token 实现（{{ form.hidden_tag() }}）。为使 clickable search links 有效，需 bypass CSRF validation。

2、展示 Search Form （visible in all pages，不含 error page）

常规方法： creat a form object in every route, then pass the form to all the templates

利用 before_request 实例化 g.search form = SearchForm()

app / main / routes.py:

from flask import g
from app.main.forms import SearchForm

@bp.before_app_request
def before_request():
    if current_user.is_authenticated:
        current_user.last_seen = datetime.utcnow()
        db.session.commit()
        g.search_form = SearchForm()
    g.locale = str(get_locale())

g 针对 request，完整地贯穿某个 request 的生命周期，所以绑定的 search_form 也会如此。
当 before_request handler 结束，某个 URL 对应的 view func 被激活来处理 request 时，g 维持不变。
g 特定于 request 及 client，当 server 同时处理多位 clients 的多个 requests 时，仍然可以使用 g 完成 private storage，每个 request 的g 独立于并发的其他 request。

g variable is specific to each request and each client, so even if your web server is handling multiple requests at a time for different clients, you can still rely on g to work as private storage for each request, independently of what goes on in other requests that are handled concurrently.

3、将 g.search_form 插入到 app / templates / base.html

            ...
            <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
                <ul class="nav navbar-nav">
                    ... home and explore links ...
                </ul>
                {% if g.search_form %}
                <form class="navbar-form navbar-left" method="get"
                        action="{{ url_for('main.search') }}">
                    <div class="form-group">
                        {{ g.search_form.q(size=20, class='form-control',
                            placeholder=g.search_form.q.label.text) }}
                    </div>
                </form>
                {% endif %}
                ...

判断是否存在 g.search_form
method="get"，因为需要通过‘GET’ 请求，将form data 提交到 query string。
action="{{ url_for('main.search') }}"，之前的表单 action 为空，是因为提交表单的 page，即是待渲染的 page。现在由于 Search 出现在所有页面，所以必须指明，将表单提交到哪里进行渲染。
action 的作用，即明确 form 提交时触发的行为。

because they were submitted to the same page that rendered the form

Search View Function

1、创建 view func，处理 search request （http://localhost:5000/search?q=search-words）

app /main / routes.py: search view function.

@bp.route('/search')
@login_required
def search():

    if not g.search_form.validate():
        return redirect(url_for('main.explore'))
    # just validate field values, without checking how the data was submitted. 

    page = request.args.get('page', 1, type=int)
    per_page = current_app.config['POSTS_PER_PAGE']

    posts, total = Post.search(g.search_form.q.data, page, per_page)

    next_url = url_for('main.search', q=g.search_form.q.data, page=page+1) \
        if total > page * per_page else None
    prev_url = url_for('main.search', q=g.search_form.q.data, page=page-1) \
        if page > 1 else None

    return render_template('search.html', title=_('Search'), posts=posts,
                            next_url=next_url, prev_url=prev_url)

# url_for() will issue 'GET' request, 
# q is the argument in http://localhost:5000/search?q=search-words, just like Google.

form.validate()，只验证 field values, 不验证数据提交的方式（form.validate_on_submit() 要求 POST）。
利用 SearchableMixin 类中的 classmethod search() ，通过Post.search()来获取 list of search results。
form 提交的 q=g.search_form.q.data，此时作为 query expression。
page 及 per_page 设置类似其他 view func。
利用返回的第二个参数 total 计算 next_url

2、创建模板 search.html

app / templates / search.html: search results template.

{% extends "base.html" %}

{% block app_content %}
    <h1>{{ _('Search Results') }}</h1>
    {% for post in posts %}
        {% include '_post.html' %}
    {% endfor %}
    <nav aria-label="...">
        <ul class="pager">
            <li class="previous{% if not prev_url %} disabled{% endif %}">
                <a href="{{ prev_url or '#' }}">
                    <span aria-hidden="true">&larr;</span>
                    {{ _('Previous results') }}
                </a>
            </li>
            <li class="next{% if not next_url %} disabled{% endif %}">
                <a href="{{ next_url or '#' }}">
                    {{ _('Next results') }}
                    <span aria-hidden="true">&rarr;</span>
                </a>
            </li>
        </ul>
    </nav>
{% endblock %}

这里写图片描述

Kungreye

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
The Flask Mega-Tutorial 之 Chapter 16：Full Text Search （全文搜索）

小引为 Microblog 添加全文搜索，对于给定的搜索词（search term），返回包含搜索词的所有 posts，并按照相关度降序排列。Intro to Full-Text Search Engines1、开源 full-text search 引擎：ElasticsearchApache SolrWhooshXapianSphinx2、具备搜索能力的...
复制链接

扫一扫