Diff () between two collections in MongoDB

Question

Diff () between two collections in MongoDB

I did a study. I apologize if this is a duplicate question, but decisions on other issues were not very suitable, and so I asked a new question.

What is the best way to use Javascript to compare two collections?

I have thousands of these headers in this Mongo document format:

{ "url": "google.com", "headers": { "location": "http://www.google.com/", "content-type": "text/html; charset=UTF-8", "date": "Mon, 25 Mar 2013 18:12:08 GMT", "expires": "Wed, 24 Apr 2013 18:12:08 GMT", "cache-control": "public, max-age=2592000", "server": "gws", "content-length": "219", "x-xss-protection": "1; mode=block", "x-frame-options": "SAMEORIGIN" } }

I launched my scraper today. I would run it again in the future and save it in the second collection. In addition, I would like to be able to compare three specific header objects, which are server , x-aspnet-version and x-powered-by , and determine if there are integer increments.

What is the best way to iterate through two collections and make diff ()?

Am I doing it right? Any suggestions would be really appreciated.

+4

javascript node.js diff mongodb

theGreenCabbage Mar 25 '13 at 18:16

source share

1 answer

marr75 · Accepted Answer · 2013-03-25T23:49:34+0000

A few suggestions:

You can use a combination of url and an available date (at least part of the datetime object) as the _id for these objects, since from what I can tell, you plan to clear each URL once a month.

Example:

 { "_id": { "url": "www.google.com", "date": ISODate("2013-03-01"), }, // Other attributes }

This gives dividends of performance, uniqueness and query (see this 4sq blog post ). You can request the execution of something like:

 db.collection.find({ "_id": { "$gte": { "url": yourUrl, "date": rangeStart }, "$lt": { "url": yourUrl, "date": rangeEnd }, } })

Which gives an excellent, beautifully sorted (by URL, i.e. by date, which seems to be exactly what you want). You can also use this index to execute closed requests (above the _id field) if you just want to get a good set of all the URLs and months that you have cleaned up (this can cause you to be able to go through each URL one at a time )

If you have certain attributes of the document that you are interested in comparing (for example headers.server ), and the specific comparison that you want to make for them (for example, for any increase in the number of versions), I would use some kind of regular expression to grab items related to the version number (quick and dirty can just get all the numeric elements) and draw them for each URL (I assume this will allow you to visualize changes in the server software over time). You can also easily report when any of these attributes has changed by scanning them in order and setting up some event when the lines were not identical (perhaps afterwards reporting the change or the numerical part of the change).

Diff () between two collections in MongoDB

More articles: