Kirill Zonov

Chop-chop, MongoDB! Leveraging indexes power. Part 3. Text indexes

November 28, 2017 | 8 Minute Read

Hey folks! Today we will try to find some text in our collection. And then we will add text indexes there and behold, how it become better (or not). Let’s grab a beer and start.

Index creation.

Text indexes were around for a while, but the latest version had been released in MongoDB 3.2. So keep it in mind when you’ll try things I describe here. Let’s create some data first:

var bulk = db.posts.initializeUnorderedBulkOp();
bulk.insert( { title: "Chop-chop, MongoDB! Leveraging indexes power. Part 3. Text indexes", url: "http://zonov.me/chop-chop-mongodb-leveraging-indexes-power-part-3-text-indexes/" } );
bulk.insert( { title: "Chop-chop, MongoDB! Leveraging indexes power. Part 2. Multikey indexes", url: "http://zonov.me/chop-chop-mongodb-leveraging-indexes-power-part-2-multikey-indexes/" } );
bulk.insert( { title: "Chop-chop, MongoDB! Leveraging indexes power. Part 1", url: "http://zonov.me/chop-chop-mongodb-leveraging-indexes-power-part-1/" } );
bulk.insert( { title: "Introduction to Terraform. Terraform + Github", url: "http://zonov.me/terrafom-introduction-with-github/" } );
bulk.insert( { title: "How to install and use PostgreSQL (or whatever you want) using Docker", url: "http://zonov.me/how-to-install-postgresql-using-docker/" } );
bulk.insert( { title: "Pair programming. Do’s and don’ts. And hows.", url: "http://zonov.me/pair-programming-dos-and-donts-and-hows/" } );
bulk.insert( { title: "Python for Data Analysis book review", url: "http://zonov.me/python-for-data-analysis-book-review/" } );
bulk.insert( { title: "PostgreSQL transactions Isolation levels", url: "http://zonov.me/postgresql-transactions-isolation-levels/" } );
bulk.insert( { title: "Comparison of Ruby and Python’s Pandas for Data refinement", url: "http://zonov.me/comparison-of-ruby-and-pythons-pandas-for-data-refinement/" } );
bulk.execute();<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>

Let’s try to find all records, where I wrote anything related to PostgreSQL:

db.posts.find( { $text: { $search: "PostgresQl" } } )

Baaang:

Error: error: {
	"ok" : 0,
	"errmsg" : "text index required for $text query",
	"code" : 27,
	"codeName" : "IndexNotFound"
}

I’m pretty sure, anything can be better than that, so let’s add an index, I believe there is no way to screw it up even more.

db.posts.createIndex( { title: "text" } )

Pay attention, for Text Index you have to explicitly specify it. (As you may remember, for Multikey Indexes it worked just well without any explicit mark). Searching again and voila!

/* 1 */
{
"_id" : ObjectId("5a1cfe4ed2f7c8316ff62b30"),
"title" : "How to install and use PostgreSQL (or whatever you want) using Docker",
"url" : "http://zonov.me/how-to-install-postgresql-using-docker/"
}
/* 2 */
{
"_id" : ObjectId("5a1cfe4ed2f7c8316ff62b33"),
"title" : "PostgreSQL transactions Isolation levels",
"url" : "http://zonov.me/postgresql-transactions-isolation-levels/"
}

Keep in mind! You can create only one Text Index per collection! But! You can create one Text Index for all your fields in a collection: If you try it - don’t forget to remove the previous index

db.posts.createIndex( { "$**": "text" } )

And now you can search in all fields, f.e.:

db.posts.find( { $text: { $search: "indexes-power" } } )

this query will search in both Title and Url fields.

Sensitivity

As a human, Text Index can be sensitive. Case sensitive. Or insensitive. So, there are two ways of sensitivity:

  1. Case sensitivity. Your index can be sensitive to the case of a text. If it's insensitive, it will perceive "monGoDb" as "mongodb". Case insensitive - is a default option.
  2. Diacritic sensitivity. It's hard to explain in Latin, but in Russian, we have letters "Ъ" and "Ь". It's absolutely different letters, but as you may notice - they look pretty similar. And diacritic insensitivity means that they will be treated as the same letter. This is also a default option. (you can find more diacritic letters in Unicode spec)

Weights

By default, all your kinda text fields in a collection have the same “importance” for search. I can check it on my index:

db.posts.createIndex( { title: "text", url: "text" } )
db.posts.getIndexes()
/* 1 */
[
    {...},
    {
        "v" : 2,
        "key" : {
            "_fts" : "text",
            "_ftsx" : 1
        },
        "name" : "title_text_url_text",
        "ns" : "local.posts",
        "weights" : {
            "title" : 1,
            "url" : 1
        },
        "default_language" : "english",
        "language_override" : "language",
        "textIndexVersion" : 3
    }
]

If you want, you can manually set weights for your fields. With my following index, search results will be ranged that on first positions there more likely will be Title matches.

db.posts.createIndex( { title: "text", url: "text" }, { weights: { title: 50, url: 10 } } )

It can be useful if you have few fields and you are sure that some of them are more important for a customer. F.e. in e-commerce solution for products you may have a higher weight for a title, lower for a description and even lower for comments (yes, you can also add embedded documents into your index).

More

Some more things to keep in mind while using the Text Indexes.

  • You don't need to manually stem your words, it is done automatically by MongoDB;
  • With it, keep in mind that if you use Text Index - your index file size will grow very fast;
  • MongoDB supports different languages for search, it's important because it doesn't count stop words in the search (like a, and, the);
  • You can compound Text Index with other indexes.

I would definitely consider using MongoDB Text Indexes in case of an early startup/MVP, when there is just no time to struggle with more comprehensive but also more complex solution from Elastic or Sphinx. That is it for MongoDB indexes for now. It’s still a big topic to discuss, though, and if I will find an interest in it - I will be writing more about this NoSQL DB with its indexes. Be as fast as O(1) and have a beautiful day! :)