What Are Text Indexes in MongoDB?


MongoDB offers text indexes for search queries which include strings in their contents. At most, a collection in MongoDB can have no more than one text index. So the question is: how to build a text index?

Like other indexes, a text index can also be created using the db.collection.createIndex() method. Such an index can be built on a string array too. To define a text index for a field, you have to type “text” like the following instance.

db.employees.createIndex( {name:”text”} )

In this example, a text index has been built on the “name” field. Similarly, other fields can also be defined by using the same index.

Weights

While working with text indexes, you must become familiar with the concept of weight. Weight refers to an indexed field and marks its importance in comparison to other fields (which are also indexed) by processing the score for text search.

In all the indexed fields of a document, the match number is multiplied with the weight and the output is summed. MongoDB then takes the sum value and processes it to generate the document’s score.

By default, each indexed field carries a weight of 1. Weights can be modified in the db.collection.createIndex() method.

Wildcard Specifier

In MongoDB, there is also a wildcard specifier ($**). When this specifier is used in conjunction with a text index then it is referred to as a wildcard text index. What this does is that it applies indexing on every field which stores data in the form of strings for all the collection’s document. A wildcard specifier can be defined by using the following method.

db.collection.createIndex( {“S**”: “text”} )

Basically, wildcard text indexes can be seen as text indexes which work on more than a single field. To govern the query results’ ranking, weights can be specified for certain fields while building text indexes.

Case Insensitivity

The 3rd (latest) version of the text index offers support for the simple s and common c. The special T case folding found in Turkish is also supported.

 

The case insensitivity of the text index is further improved with support for diacritic insensitivity (a mark which represents different pronunciation) like É and é. This means that the text index does not differentiate between e, E, é, and É.

Tokenization

For purposes related to tokenization, the text index version 3 supports the following delimiters.

  • Hyphen
  • Dash
  • Quotation_Mark
  • White_Space
  • Terminal_Punctuation
  • Dash

For instance, if a text index finds the following text string, then it would treat spaces and “« »” as delimiters.

«Messi est l’un des plus grands footballeurs de tous les temps»

Sparse

By default, text indexes are “sparse” and therefore they do not need to be explicitly defined with the sparse option. When a document does not have a field indexed with a text index, then MongoDB does not make the document’s entry with the text index. When insertion occurs, then MongoDB does insert the document, however no addition occurs with the text index.

Limitations

Earlier, we talked about how there can be no more one text index for a collection. There are some more limitations for text indexes.

  • When a query entails the $text operator, then it is not possible to use the hint() method.
  • It is not possible for sort operations to utilize the text index’s arrangement or ordering, even if it is a compound text index.
  • It is possible to add a text index key to generate a compound index. Though, they have some limitations. For example, it is not possible to use special index types like geospatial index with a compound text index.

Moreover, while building a compound text index, all text index keys have to be defined adjacently. This specifying of index must come in the index specification document.

Lastly, if there are keys which precede the key of text index, then for executing a search with $text operation, it is necessary for the query predicate to use quality match conditions for the preceding keys.

  • For dropping a text index, it is mandatory to mention the index name in the method of db.collection.dropIndex(). In case you do not know the index name, the db.collection.getIndexes() method can be used.
  • There is no support for collation while working with text indexes. However, they do provide support for simple binary comparison.

Performance and Storage Constraints

While using text indexes, it is necessary to realize their impact on the performance and storage of your application

  • The creation of a text index is not too dissimilar to creating a huge multikey index. A text index takes considerably more time in comparison to a basic ordered index.
  • Considering the nature of applications, it is possible for text indexes to be “enormous”. They carry a single entry for an index in correspondence with every unique or special post-stemmed word for all the indexed fields whenever documents are inserted.
  • Text indexes do not save information or phrases related to a word’s proximity within the document. Therefore, queries with phrases run better in comparison if the complete collection is fitted into the RAM.
  • Text indexes have an effect on insertion operations in MongoDB. This is because MongoDB has to include entries for index with every post-stemmed string in indexed fields related to every newly-created source document.
  • While creating a big text index for a collection which exists for some time, make sure that you have a strong limit for open file descriptions.

$text Operator

The $text operator is used to conduct a textual search for a field’s contents which is text-indexed. An expression with the text operator has the following components.

  • $search – A single or multiple terms which is used by MongoDB for parsing and querying text indexes.
  • $language – An optional component which represents the language that is to be used for tokenizer, stemmer, and stop-words.
  • $caseSensitive – An optional component which is used to turn on or off the case sensitive search.
  • $diacriticSensitive – An optional component which is used to turn on or off the diacritic sensitive search.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s