• Home
  • Jobs
  • Courses
  • Teachers
  • For business
  • Blog
  • ES/EN

0

28
Views
MongoDB - Find documents with earliest occurrence of duplicate value

I have a collection within MongoDB (4.4 but the version isn't important to me) and one of the document values is an array of URLs. There will be multiple documents with multiple URLs per document (within the array) and some of those URLs will already exist. I would like to select each document with the earliest occurrence of each URL (the intention being to mark it as the 'origin'.

MongoPlayground link for sample collection - https://mongoplayground.net/p/ZAgCqr517-8

  {
    "title": "story1_first",
    "isoDate": "2022-01-01T00:00:00.000Z",
    "links": [
      "www.first.com/article1",
      "www.anotherdomain.com"
    ]
  },
  {
    "title": "story1_mention",
    "isoDate": "2022-01-10T00:00:00.000Z",
    "links": [
      "www.first.com/article1",
      "www.somesite.com"
    ]
  },
  {
    "title": "story2_first",
    "isoDate": "2022-01-20T00:00:00.000Z",
    "links": [
      "www.newstory.com/article2",
      "www.anothercompany.com"
    ]
  },
  {
    "title": "story2_mention",
    "isoDate": "2022-01-20T00:00:00.000Z",
    "links": [
      "www.newstory.com/article2",
      "www.anothercompany.com"
    ]
  }
]

In this example, I would like the query / aggregation to return the two documents with "first" in the title as they are the documents which share a common URL within 'links' and are the documents which have the earliest date. Similar to how a search engine ranks up a site depending on how many other sites link to it.

13 days ago ·

Santiago Trujillo

1 answers
Answer question

0

You can do the followings in an aggregation pipeline:

  1. $unwind links so the documents are in links level
  2. $sort on isoDate to get the first document
  3. $group by links to get count inbetween group and the id of the first document. In your example, title is taken as unique identifier.
  4. $match with count > 1 to get title that share the same link
  5. $group to dedupe the unique identifier we found in step 3
  6. $lookup back the original document and do some cosmetics by $replaceRoot
db.collection.aggregate([
  {
    "$unwind": "$links"
  },
  {
    $sort: {
      isoDate: 1
    }
  },
  {
    $group: {
      _id: "$links",
      first: {
        $first: "$title"
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        $gt: 1
      }
    }
  },
  {
    $group: {
      _id: "$first"
    }
  },
  {
    "$lookup": {
      "from": "collection",
      "localField": "_id",
      "foreignField": "title",
      "as": "rawDocument"
    }
  },
  {
    "$unwind": "$rawDocument"
  },
  {
    "$replaceRoot": {
      "newRoot": "$rawDocument"
    }
  }
])

Here is the Mongo playground for your reference.

13 days ago · Santiago Trujillo Report
Answer question
Find remote jobs
Loading

Discover the new way to find a job!

Top jobs
Top job categories
Business
Post job Plans Our process Sales
Legal
Terms and conditions Privacy policy
© 2022 PeakU Inc. All Rights Reserved.