Company logo
  • Jobs
  • Bootcamp
  • About Us
  • For professionals
    • Home
    • Jobs
    • Courses
    • Questions
    • Teachers
    • Bootcamp
  • For business
    • Home
    • Our process
    • Plans
    • Assessments
    • Payroll
    • Blog
    • Sales
    • Calculator

0

66
Views
Find similar documents based on a string in MongoDB

I need to find all documents in a MongoDB database that have a property containing a string that is similar to the search term but allows for a certain % in divergence.

In plain javascript I could for example use https://www.npmjs.com/package/string-similarity and then basically match all documents that have > 90% similarity score.

I'd like do to this as MongoDB query and be as performant as possible as the database contains millions of documents.

What possible options do I have in this situation?

  • I found something about $text search, but it doesn't seem it helps a lot
  • I was thinking about creating some sort of signature for each document, like some sort of hash that allows for some sort of divergence.

I am really happy for every idea to get this solved in the best possible way.

7 months ago · Juan Pablo Isaza
1 answers
Answer question

0

The common solution to this problem is to use a search engine database, like Elasticsearch or Atlas search (by Mongodb team). I will not go into too much detail on how these databases work but generally speaking they are an inverse index database, this means you tokenize your data on insert and then your queries run on the tokenized data and not on the raw data set.

This approach is very powerful and can help with many "search engine" problems like autocomplete or in your case what is called a "fuzzy" search.

Let's see how elasticsearch deals with this by reading about their fuzzy feature:

To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.

Basically what they do is create all "possible" permutations of the query within the given parameters. I would personally recommend you just use one of these databases that give this ability OOTB, however if you want to do a "pseudo" search engine in Mongo you can just use this approach ( with the downside of Mongo's indexes being a tree so you force a tree scan for these queries instead of a db designed for this )

7 months ago · Juan Pablo Isaza Report
Answer question
Find remote jobs