给 CSDN 博客建立一个全文索引应用
–2018.08.11
- 首先要解决的问题是,Python 访问 ElasticSearch 数据库的接口
- 在 Django 的网页架构基础上,将用户请求发送给 ElasticSearch,返回结果
- 需要保存用户每一次的搜索关键字
- 提供并发可靠保证
Python 下可以与 ElasticSearch 交互的客户端有两个:
- elasticsearch-py
- elasticsearch-dsl
elasticsearch-dsl 是建立在 elasticsearch-py 之上的,相比之下,更加符合 python 使用者的习惯
elasticsearch-py 更加灵活和易于扩展。
Since I was using Django — which is written in Python — it was easy to interact with ElasticSearch. There are two client libraries to interact with ElasticSearch with Python. There’s ++elasticsearch-py++, which is the official low-level client. And there’s ++elasticsearch-dsl++, which is build upon the former but gives a higher-level abstraction with a bit less functionality.
elasticsearch-py 的用法如下:
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()
doc = {
'author': 'kimchy',
'text': 'Elasticsearch: cool. bonsai cool.',
'timestamp': datetime.now(),
}
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc)
print(res['result'])
res = es.get(index="test-index", doc_type='tweet', id=1)
print(res['_source'])
es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])
查询部分更加接近于 ElasticSearch DSL 原始的语法
在 Kibana 中调试 ElasticSearch 查询的时候,通常我们会使用 ElasticSearch 文档中所教授的语法,这种语法拿来在 elasticSearch-py 下直接可以运行:
from elasticsearch import Elasticsearch
client = Elasticsearch()
response = client.search(
index="my-index",
body={
"query": {
"filtered": {
"query": {
"bool": {
"must": [{"match": {"title": "python"}}],
"must_not": [{"match": {"description": "beta"}}]
}
},
"filter": {"term": {"category": "search"}}
}
},
"aggs" : {
"per_tag": {
"terms": {"field": "tags"},
"aggs": {
"max_lines": {"max": {"field": "lines"}}
}
}
}
}
)
for hit in response['hits']['hits']:
print(hit['_score'], hit['_source']['title'])
for tag in response['aggregations']['per_tag']['buckets']:
print(tag['key'], tag['max_lines']['value'])
elasticsearch-dsl 的用法如下:用函数来封装了一层 DSL
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
client = Elasticsearch()
s = Search(using=client, index="my-index") \
.filter("term", category="search") \
.query("match", title="python") \
.exclude("match", description="beta")
s.aggs.bucket('per_tag', 'terms', field='tags') \
.metric('max_lines', 'max', field='lines')
response = s.execute()
for hit in response:
print(hit.meta.score, hit.title)
for tag in response.aggregations.per_tag.buckets:
print(tag.key, tag.max_lines.value)
如此看来, elasticsearch-py 提供了全套基础建设,包括 DSL, 而 elasticsearch-dsl 只是对其中的检索功能做了封装,而本身还依赖于 elasticsearch-py 提供的底层框架
try:
import _pickle as pickle
except ImportError:
import pickle
import os
from elasticsearch import Elasticsearch
class loadJson(object):
def loadAllFiles(self,path):
localPath = os.fsencode(path)
for file in os.listdir(localPath):
filename = path+os.fsdecode(file)
filehandler = open(filename,'rb')
jsonObj = pickle.load(filehandler)
filehandler.close()
self.saveToElasticSearch(jsonObj)
print(jsonObj)
def saveToElasticSearch(self,doc):
es = Elasticsearch("http://192.168.1.112:9200")
es.index(index="csdnblog",doc_type="CSDNPost",body=doc)
utlLoader = loadJson()
utlLoader.loadAllFiles("G:\\SideProjects\\CSDN_Blogs\\PostThread\\")
上面的代码,作用是将我从 CSDN 中爬取的 Blog 保存为本地 Json 文件之后,反序列化这些 Json 文件,最终存入 ElasticSearch 做全文索引。
安装 ElasticSearch 客户端
在实现上述的功能之前,我们还必须在 virtualenv 下建立的Django 中安装 ElasticSearch 客户端。
定位到 virtualenv 目录,激活 virtualenv 环境,安装 elasticsearch 客户端:
activate.bat
pip3 install elasticsearch
pip3 list
安装完毕之后,使用 pip3 list 来查看已经安装的包.
此时安装的便是低层次的 elasticsearch 客户端,接近于 elasticsearch DSL 语法的客户端,而 elasticsearch-dsl 便是基于这个库二次开发的库。安装的时候加上后缀名 -dsl便可:
pip3 install elasticsearch-dsl
提供一个访问 elasticsearch 的入口
之前的 Django 项目,我们在 SqlHub 下顺利可以实现请求视图函数之间的联动。以此为基础,在 Index.html 中增加一个表单,指向即将新建的视图函数,用来返回从 elaticsearch 请求的结果。
关键点是在 SqlHub\Index.html 中创建动作 FullTextSearch 以及在 views.py 中配置好动作的视图函数 fulltextsearch,使其可将结果展现。
创建搜索表单
<form action="/SqlHub/FullTextSearch" method = "post">
{% csrf_token %}
Search Key Word:<input type = text name = keyword><br>
<input type = submit>
</form>
参考文章:
https://medium.freecodecamp.org/elasticsearch-with-django-the-easy-way-909375bc16cb
该文告诉我们的是如何使用 elasticsearch-dsl 实现 CRUD 的操作,并且 Django 项目中无需配置 elasticsearch ,仅需要安装 elasticsearch 库并正确引用即可。
在这里我只是做了一个参考,因此本次使用的是纯正的 elasticsearch-py 版本。
https://elasticsearch-py.readthedocs.io/en/master/
这是 elasticsearch Python 客户端的官方文档。可以找到一切有关 Python 访问 elasticsearch 的方法
实现简单的 elasticsearch 全文索引的视图函数
该视图函数接收用户提交的请求,并将该请求丢给 elasticsearch 处理,接收到结果后,调用 elasticsearch 展现界面( es.html) 来展示此次请求的结果
from django.shortcuts import render_to_response, render
from SqlHub.models import SqlNew
from django.template import RequestContext
from django.http import HttpResponseRedirect
import time
import datetime
from elasticsearch import Elasticsearch
def archive(request):
posts = SqlNew.objects.all()
curtime = datetime.datetime.now()
context = {"posts": posts, "curtime": curtime}
return render(request, 'Index.html', context)
def newone(request):
curtime = datetime.datetime.now()
oneblog = SqlNew()
oneblog.title = request.POST["title"]
oneblog.body = request.POST["body"]
oneblog.timestamp = curtime
oneblog.save()
return HttpResponseRedirect('/SqlHub')
def fulltextsearch(request):
es = Elasticsearch({"192.168.1.10:9200"})
ret = es.search(index="csdnblog2"
,body= {
"query":{
"term":{"pageContent": "cluster"}
}
}
)
resultback = ret["hits"]["hits"]
context_rs = {"results":resultback}
return render(request,'es.html',context_rs)
提供一个展现 elasticsearch 全文索引查询结果的模板
在该模板上也要实现用户提交 elaticsearch 请求的动作。
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<form action="/SqlHub/FullTextSearch" method = "post">
{% csrf_token %}
Search Key Word:<input type = text name = keyword><br>
<input type = submit>
</form>
{% for item in results %}
{% for key,value in item.items %}
{% if key == "_source" %}
{% for key1,value1 in value.items %}
{% if key1 == "article_url" %}
{{ value1 }}<br>
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{% endfor %}
</body>
</html>
Django 是无法访问 Python 数据字典的,因此只能用这类方法解决一下。或者将数据字典改为对象。
最终还要配置表单动作与视图函数的映射关系:
from django.urls import path, include
import SqlHub.views
urlpatterns = [ path(r'',SqlHub.views.archive),
path(r'New', SqlHub.views.newone),
path(r'FullTextSearch', SqlHub.views.fulltextsearch),
]