Python + Elasticsearch。第一步。

最新推荐文章于 2022-03-22 14:06:43 发布

cumei1658

最新推荐文章于 2022-03-22 14:06:43 发布

阅读量459

点赞数

文章标签：分布式 python java elasticsearch 大数据

原文链接：https://www.pybloggers.com/2015/02/python-elasticsearch-first-steps/

版权

Lately, here at Tryolabs, we started gaining interest in big data and search related platforms which are giving us excellent resources to create our complex web applications. One of them is Elasticsearch.

最近，在Tryolabs，我们开始对大数据和搜索相关平台感兴趣，这为我们提供了创建复杂Web应用程序的出色资源。其中之一是Elasticsearch 。

Elastic{ON}15, the first ES conference is coming, and since nowadays we see a lot of interest in this technology, we are taking the opportunity to give an introduction and a simple example for Python developers out there that want to begin using it or give it a try.

Elastic {ON} 15 ，第一次ES会议即将到来，并且由于当今我们对这种技术有很多兴趣，我们借此机会为想要开始使用它的Python开发人员提供了一个介绍和一个简单的示例。或尝试一下。

1.什么是Elasticsearch？ (1. What is Elasticsearch?)

Elasticsearch is a distributed, real-time, search and analytics platform.

Elasticsearch是一个分布式的实时搜索和分析平台。

2.是的，但是Elasticsearch是什么？ (2. Yeah, but what IS Elasticsearch?)

Good question! In the previous definition you can see all these hype-sounding tech terms (distributed, real-time, analytics), so let’s try to explain.

好问题！在前面的定义中，您可以看到所有这些听起来不错的技术术语（分布式，实时，分析），因此让我们尝试进行解释。

ES is distributed, it organizes information in clusters of nodes, so it will run in multiple servers if we intend it to.

ES是分布式的 ，它在节点群集中组织信息，因此如果我们愿意，它将在多个服务器中运行。

ES is real-time, since data is indexed, we get responses to our queries super fast!

ES是实时的 ，由于已对数据建立索引，因此我们可以非常快速地获得对查询的响应！

And last but not least, it does searches and analytics. The main problem we are solving with this tool is exploring our data!

最后但并非最不重要的一点是，它可以进行搜索和分析。我们使用此工具解决的主要问题是浏览数据！

A platform like ES is the foundation for any respectable search engine.

像ES这样的平台是任何受人尊敬的搜索引擎的基础。

3.它如何运作？ (3. How does it work?)

Using a restful API, Elasticsearch saves data and indexes it automatically. It assigns types to fields and that way a search can be done smartly and quickly using filters and different queries.

使用一个宁静的API，Elasticsearch保存数据并自动对其建立索引。它为字段分配类型，从而可以使用过滤器和不同的查询来快速，聪明地完成搜索。

It’s uses JVM in order to be as fast as possible. It distributes indexes in “shards” of data. It replicates shards in different nodes, so it’s distributed and clusters can function even if not all nodes are operational. Adding nodes is super easy and that’s what makes it so scalable.

它使用JVM以便尽可能快。它在数据的“碎片”中分配索引。它在不同的节点上复制分片，因此即使不是所有节点都在运行，它也可以分布并且集群也可以运行。添加节点非常容易，这就是它如此可扩展的原因。

ES uses Lucene to solve searches. This is quite an advantage with comparing with, for example, Django query strings. A restful API call allows us to perform searches using json objects as parameters, making it much more flexible and giving each search parameter within the object a different weight, importance and or priority.

ES使用Lucene解决搜索。与Django查询字符串相比，这是一个很大的优势。宁静的API调用使我们能够使用json对象作为参数来执行搜索，从而使其更加灵活，并为对象内的每个搜索参数赋予不同的权重，重要性和/或优先级。

The final result ranks objects that comply with the search query requirements. You could even use synonyms, autocompletes, spell suggestions and correct typos. While the usual query strings provides results that follow certain logic rules, ES queries give you a ranked list of results that may fall in different criteria and its order depend on how they comply with a certain rule or filter.

最终结果将对符合搜索查询要求的对象进行排名。您甚至可以使用同义词，自动完成功能，拼写建议和正确的错字。虽然通常的查询字符串会提供遵循某些逻辑规则的结果，但ES查询会为您提供可能属于不同条件的结果排名列表，其顺序取决于它们如何遵守特定规则或过滤器。

ES can also provide answers for data analysis, like averages, how many unique terms and or statistics. This could be done using aggregations. To dig a little deeper in this feature check the documentation here.

ES还可以提供数据分析的答案，例如平均值，多少个唯一术语和/或统计数据。这可以使用聚合来完成。要深入了解此功能，请查看此处的文档。

4.我应该使用ES吗？ (4. Should I use ES?)

The main point is scalability and getting results and insights very fast. In most cases using Lucene could be enough to have all you need.

重点是可伸缩性，可以非常快速地获得结果和见解。在大多数情况下，使用Lucene可以满足您的所有需求。

It seems sometimes that these tools are designed for projects with tons of data and are distributed in order to handle tons of users. Startups dream of growing to that scenario, but may start thinking small first to build a prototype and then when the data is there, start thinking about scaling problems.

有时似乎这些工具是为具有大量数据的项目而设计的，并且为了处理大量用户而进行了分发。初创企业梦想着发展到这种情况，但是可能会开始小思考以构建原型，然后在数据存在时开始考虑扩展问题。

Does it make sense and pays off to be prepared to grow A LOT? Why not? Elasticsearch has no drawback and is easy to use, so it’s just a decision of using it to be prepared for the future.

准备大量种植是否有意义并有回报？为什么不？ Elasticsearch没有缺点，并且易于使用，因此这只是使用它为将来做准备的决定。

I’m going to give you a quick example of a dead simple project using Elasticsearch to quickly and beautifully search for some example data. It will be quick to do, Python powered and ready to scale in case we need it to, so, best of both worlds.

我将为您提供一个简单示例，该示例使用Elasticsearch快速精美地搜索一些示例数据，该示例已失效。这将是一件容易的事，Python提供了强大的功能，并准备好进行扩展，以防万一我们需要两者兼具。

5.使用ES轻松完成第一步 (5. Easy first steps with ES)

For the following part it would be nice to be familiarized with concepts like Cluster, Node, Document, Index. Take a look at the official guide if you have doubts.

对于接下来的部分，熟悉诸如集群，节点，文档，索引之类的概念将是很好的。如有疑问，请看官方指南。

First things first, get ES from here.

首先，从这里获取ES。

I followed this video tutorial to get things started in just a minute. I recommend all you to check it out later.

我按照此视频教程进行操作，很快就可以开始学习。我建议大家稍后再检查。

Once you downloaded ES, it’s as simple as running bin/elasticsearch and you will have your ES cluster with one node running! You can interact with it at http://localhost:9200/

一旦下载了ES，它就像运行bin / elasticsearch一样简单，并且您的ES集群将运行一个节点！您可以在http：// localhost：9200 /与之互动

If you hit it you will get something like this:

如果您点击它，您将得到如下内容：

Creating another node is as simple as:

创建另一个节点非常简单：

bin/elasticsearch -Des.node.name=Node-2

bin/elasticsearch -Des.node.name=Node-2

It automatically detects the old node as its master and joins our cluster. By default we will be able to communicate with this new node using the 9201 port http://localhost:9201. Now we can talk with each node and receive the same data, they are supposed to be identical.

它会自动将旧节点检测为主节点，并加入我们的集群。默认情况下，我们将能够使用9201端口http：// localhost：9201与该新节点进行通信。现在我们可以与每个节点交谈并接收相同的数据，它们应该是相同的。

6.让我们对这件事进行Python化！ (6. Let’s Pythonize this thing!)

To use ES with our all time favorite language; Python, it gets easier if we install elasticsearch-py package.

将ES与我们一直喜欢的语言一起使用； Python，如果我们安装elasticsearch-py软件包，它将变得更加容易。

Now we will be able to use this package to index and search data using Python.

现在，我们将能够使用此包通过Python索引和搜索数据。

7.让我们向集群添加一些公共数据 (7. Let’s add some public data to our cluster)

So, I wanted to make this project a “real world example”, I really did, but after I found out there is a star wars API (http://swapi.co/), I couldn’t resist it and ended up being a fictional – ”galaxy far far away” example. The API is dead simple to use, so we will get some data from there.

因此，我确实做到了，使这个项目成为“真实世界的例子”，但是当我发现有一个星球大战API（ http://swapi.co/ ）之后，我无法抗拒并最终失败了。是一个虚构的“银河遥远”的例子。该API非常简单易用，因此我们将从那里获取一些数据。

I’m using an IPython Notebook to do this test, I started with the sample request to make sure we can hit the ES server.

我正在使用IPython Notebook进行此测试，首先从示例请求开始，以确保我们可以访问ES服务器。

Then we connect to our ES server using Python and the elasticsearch-py library:

然后，我们使用Python和elasticsearch-py库连接到ES服务器：

I added some data to test, and then deleted it. I’m skipping that part for this guide, but you can check it out in the notebook.

我添加了一些数据进行测试，然后将其删除。我在本指南中省略了该部分，但是您可以在笔记本中查看。

Now, using The Force, we connect to the Star Wars API and index some fictional people.

现在，使用The Force，我们可以连接到《星球大战》 API并索引一些虚构的人。

Please, notice that we automatically created an index “sw” and a “doc_type” with de indexing command. We get 17 responses from swapi and index them with ES. I’m sure there are much more “people” in the swapi DB, but it seems we are getting a 404 with http://swapi.co/api/people/17. Bug report here! 🙂

请注意，我们使用de indexing命令自动创建了索引“ sw”和“ doc_type”。我们从swapi收到17条回复，并用ES对其进行索引。我确定swapi数据库中还有更多的“人”，但是看来我们通过http://swapi.co/api/people/17得到了404。错误报告在这里！ 🙂

Anyway, to see if all worked with this few results, we try to get the document with id=5.

无论如何，要查看所有结果是否都适用，我们尝试获取id = 5的文档。

We will get Princess Leia:

我们会得到莱娅公主：

Now, let’s add more data, this time using node 2! And let’s start at the 18th person, where we stopped.

现在，让我们使用节点2添加更多数据！让我们从第18个人开始，我们在那里停了下来。

We got the rest of the characters just fine.

我们剩下的角色都很好。

8.现在，让我们尝试一个有趣的搜索 (8. Now, let’s try an interesting search)

Where is Darth Vader? Here is our search query:

达斯·维达在哪里？这是我们的搜索查询：

This will give us both Darth Vader AND Darth Maul. Id 4 and id 44 (notice that they are in the same index, even if we use different node client call the index command). Both results have a score, although Darth Vader is much higher than Darth Maul (2.77 vs 0.60) since Vader is a exact match. Take that Darth Maul!

这将给我们达斯·维达和达斯·莫尔。 ID 4和ID 44（请注意，即使它们使用不同的节点客户端调用index命令，它们也位于同一索引中）。尽管达斯·维达（Darth Vader）完全匹配，但达斯·维达（Darth Vader）却比达斯·莫尔（Darth Maul）高得多（2.77比0.60），两个结果均得分。拿那个达斯·莫尔！

So, this query will give us results if the word is contained exactly in our indexed data. What if we want to build some kind of autocomplete input where we get the names that contain the characters we are typing?

因此，如果单词恰好包含在索引数据中，则此查询将为我们提供结果。如果我们想构建某种自动完成输入，以获取包含正在键入的字符的名称怎么办？

There are many ways to do that and another great number of queries. Take a look here to learn more. I picked this one to get all documents with prefix “lu” in their name field:

有很多方法可以做到这一点，还有很多查询。看这里了解更多。我选择了此文件，以在其名称字段中获取所有带有前缀“ lu”的文档：

We will get Luke Skywalker and Luminara Unduli, both with the same 1.0 score, since they match with the same 2 initial characters.

我们将获得卢克·天行者和卢米纳拉·恩杜利，他们的分数相同，都是1.0，因为他们与相同的2个初始字符匹配。

There are many other interesting queries we can do. If, for example, we want to get all elements similar in some way, for a related or correction search we can use something like this:

我们还有许多其他有趣的查询。例如，如果我们希望以某种方式使所有元素相似，那么对于相关搜索或更正搜索，我们可以使用如下所示的内容：

And we got Jabba although we had a typo in our search query. That is powerful!

尽管我们的搜索查询中有错字，但我们还是得到了Jabba。真厉害！

9.后续步骤 (9. Next Steps)

This was just a simple overview on how to set up your Elasticsearch server and start working with some data using Python. The code used here is publicly available in this IPython notebook.

这只是有关如何设置Elasticsearch服务器并开始使用Python处理某些数据的简单概述。此处使用的代码可在此IPython笔记本中公开获得。

We encourage you to learn more about ES and specially take a look at the Elastic stack where you will be able to see beautiful analytics and insights with Kibana and go through logs using Logstash.

我们鼓励您了解有关ES的更多信息，并特别看一下Elastic stack ，在其中您可以使用Kibana查看美丽的分析和见解，并可以使用Logstash查看日志。

In following posts we will talk about more advanced ES features and we will try to extend this simple test and use it to show a more interesting Django app powered by this data and by ES.

在接下来的帖子中，我们将讨论更高级的ES功能，我们将尝试扩展此简单测试，并使用它来显示一个由该数据和ES驱动的更有趣的Django应用。