[NOTE] Indexing MongoDB with ES
What is Full Text Search?
omitted (Refer to ‘Information Retrieval (IR)’)
Why don’t we use MongoDB to search full text?
MongoDB supports full text search.
For example:
- Step 1: create a database called fulltext in MongoDB:
$ mongo
$ use fulltext
- Step 2: add a collection called articles:
$ db.createCollection('articles')
- Step 3: insert some articles including two fields, title and content:
$ db.articles.insert({title:'xxxx',content:'xxx'})
- Step 4: Now we have documents, and then we need to index them using a MongoDB text index. So create a text index in both the title and content fields of the articles collection:
$ db.articles.createIndex({title:'text',content:'text'})
- Finally, index created, so it’s time to search something.
$ db.articles.find({$text:{$search:'chinese'}})
Seems it’s working fine, but why don’t use it? Please search for the word ‘chi’:
$ db.articles.find({$text:{$search:'chi'}})
The result is empty! So as you can see, the one of the biggest limitations of MongoBD is that it’s impossible by using a text index to do that called partial matching .
How to use Elasticsearch?
The advantages of the indexing engine are not only it can provide the function of partial matching, but it can satisfy other indexing requirements. The above is just a simple example. Then, how to use Elasticsearch for full text search in MongoDB?
ES Installation: Since ES is built on Java, just make sure you have installed Java and the JAVA_HOME variable set.
Once you have installed ES, this is the overall process we’ll follow:
- Create the index for our documents.
- Import our MongoBD collection into ES with a tool called mongo-connector.
- Migrate the index created by mongo-connector in ES to the index we created in step 1.
- Try out our new index and see how documents are indexed all the time while we keep the mongo-connector running.
So, let’s start for more details.
How to create an ES index?
What is Analysis Chain?
We’ll have to define Analysis Chain which is a pipeline through which each of our documents we insert into the index will go through in order to be indexed.
An analysis chain is formed by analysers. In short, Analysers are composed by three functions:
- A character filter: Cleaning up the string before it’s tokenized.
- A tokenizer filter: Splitting the string (eg. by spaces) into terms
- A token filter: Modify terms to optimize the index purpose.
Use one of analysers: Edge N-grams
ES provides different analysers. One of analysers is called edge_ngrams analyser.
Explanation:
N-Gram wikipedia: An n-gram is a contiguous sequence of n items from a given sequence of text or speech.
For example, word ‘blueberry’:
The 1-gram or unigram will be : [b, l, u, e, b, e, r, r, y]
The 2-gram will be : [bl, lu, ub, be, er, rr, ry]
etc.
Edge N-grams: Edge n-grams are anchored to the beginning of the word.
For example, word ‘blueberry’:
The edge N-grams will be : [b, bl, blu, blue, blueb, bluebe, blueber, blueberr, blueberry]
So we can create the filter named autocomplete_filter:
{
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20
}
}
}
I used 3 as minimum is because for very big databases, having unigrams would slow down the performance a lot, since lots of documents would match the search.
And now we need to define our custom analyser named autocomplete:
{
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
Here we set two filtering step: lowercase and autocomplete_filter.
Use REST-API
We have deifned the filter and the analyser, so let’s create the index then. Use curl command:
$ curl \
-H 'Content-Type: application/json' \
-X PUT http://localhost:9200/fulltext_opt \
-d \
"{ \
\"settings\": { \
\"number_of_shards\": 1, \
\"analysis\": { \
\"filter\": { \
\"autocomplete_filter\": { \
\"type\": \"edge_ngram\", \
\"min_gram\": 3, \
\"max_gram\": 20 \
} \
}, \
\"analyzer\": { \
\"autocomplete\": { \
\"type\": \"custom\", \
\"tokenizer\": \"standard\", \
\"filter\": [ \
\"lowercase\", \
\"autocomplete_filter\" \
] \
} \
} \
} \
} \
}"
The {acknowledged: true}
response means our index was successfully created. The fulltext_opt in the endpoint of URL tells ES to create a new index named like that.
The last thing we have to do in our index fulltext_opt is to create the mappings. We’ll create a mapping called articles and we’ll define the property title and content on it:
$ curl \
-H 'Content-Type: application/json' \
-X PUT http://localhost:9200/fulltext_opt/_mapping/articles \
-d \
"{ \
\"articles\": { \
\"properties\": { \
\"title\": { \
\"type\": \"text\", \
\"analyzer\": \"autocomplete\" \
}, \
\"content\": { \
\"type\": \"text\" \
} \
} \
} \
}"
The {acknowledged: true}
response means the mappings added.
That’s all! Our index in ES has created!
How to import documents from MongoDB into ES?
There’s a tool called mongo-connector, which is what we need!
- Step 1: You can install the mongo-connector using the Python package manager pip. You’ll need to install the elastic2-doc-manager which will provide the support to copy stuff from MongoDB into ElasticSearch 2.X or 5.X (6.X is not supported presently).
$ pip install mongo-connector
$ pip install elastic2-doc-manager
- Step 2: Start our MongoDB server as a replica set. The step is same with Solr because both of Solr and ES need to use mongo-connector to build the relationship with MongoDB. ( For more details about Solr, see Indexing MongoDB Data in Apache Solr )
$ mongod --replSet rs0
$ mongo
> rs.initiate()
> exit
- Step 3: Go into your ES installation dirctory and run ( And create an index like fulltext_opt ).
$ ./bin/elasticsearch
$ curl .../fulltext_opt ...
- Step 4: It’s time to run the mongo-connector ( for es 2.X or 5.X ).
$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager
Now, you can see the two indices, fulltext from MongoDB and fulltext_opt that we created, using the command: curl localhost:9200/_cat/indices?v
.
- Step 5: So we have now two indices, one created by mongo-connector which is not optimized and has our two documents, and another one optimized but empty. All we have to do now is copy the documents between indices.
There is a great tool for this purpose called elasticdump which makes this task extremely easy.
$ npm install -g elasticdump
$ elasticdump \
--input=http://localhost:9200/fulltext \
--output=http://localhost:9200/fulltext_opt
You can also run the indices query in ES and see docs.count of the fulltext_opt, which has been changed to 2 instead of 0: curl localhost:9200/_cat/indices?v
.
That’s it, the new index fulltext_opt has had documents copied from fulltext of MongoDB.
- Finally, try out our new index using the search command:
$ curl \
-H 'Content-Type: application/json' \
localhost:9200/fulltext_opt/articles/_search?pretty \
-d "{ \"query\": { \"match\": { \"title\": { \"query\": \"chi\", \"analyzer\": \"standard\" } } } }"
Got our document back!
What about the incremental indexing?
So far we moved all documents in MongoDB to the fulltext_opt index, but there is a problem that if we keep mongo-connector running, all the new doucments inserted in MongoDB will be indexed in the fulltext in ES and not optimized the fulltext_opt. ( Except using elasticdump again.)
The way to solve this problem is to configure mongo-connector a bit more. There are many configuration options that you can find here.
You can see how to configure mongo-connector via a JSON file. Here I’ll just use the command line.
$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager -n fulltext.articles -g fulltext_opt.articles
In the command, use namespaces.include
(-n in command line) and namespaces.mapping
(-g in command line) to connect with each other.
Conclusion
What can we learn in the note?
- How to use MongoDB to search full-text?
- Why ES?
- What are analysers?
- How to create an ES index?
- How to import MongoDB documents into ES?
- How to implement the incremental indexing?