Is it efficient to find duplicates in a limited many-to-many data set?

Question

Is it efficient to find duplicates in a limited many-to-many data set?

I need to write a version of the operational operation on something our webapp allows you to do a more limited basis from the user interface. The desired operation is to assign category objects. A category can have several objects, but a given object can be in only one category.

Workflow for the task:

1) Using a browser, a file of the following form is downloaded:

# ObjectID, CategoryID Oid1, Cid1 Oid2, Cid1 Oid3, Cid2 Oid4, Cid2 [etc.]

A file will most likely have tens to hundreds of lines, but it can definitely have thousands of lines.

In an ideal world, this object identifier will only occur once in a file (which reflects the fact that an object can be assigned to only one category) But since the file is created outside our control, there is no guarantee that this is true, and processing should deal with this opportunity.

2) The server will receive the file, analyze it, pre-process it and show the page like this:

 723 objects to be assigned to 126 categories 142 objects not found 42 categories not found Do you want to continue? [Yes] [No]

3) If the user clicks Yes , the server will actually do the work.

Since I do not want to analyze the file in both steps (2) and (3), since part (2), I need to build a container that will live through queries and conduct a useful presentation of data that will allow me to easily provide data to fill the page preview and let me do the actual work efficiently. (Although it is obvious that we have sessions, we usually store a very small session state in memory.)

There is an existing

 assignObjectsToCategory(Set<ObjectId> objectIds, CategoryId categoryId)

which is used when assignment is done through the user interface. it is highly desirable that the bulk operation also use this API, as it does a bunch of different business logic in addition to the simple one, and we need the same business logic to run when this main one is assigned.

Initially, it would be nice if the file "illegally" specified several categories for this object - it would be good to assign an object to one of the categories associated with it file c.

So, I initially thought that in step (2), when I went through the file that I created and placed in the cross-request container a Map<CategoryId, Set<ObjectId>> (in particular, HashMap for quick search and insert), and then when it's time to do the work, I could just go to the map and for each CategoryId pull out the associated Set<ObjectId> and pass them to assignObjectsToCategory() .

However, the requirement for handling duplicate ObjectId has changed. And now they need to be processed as follows:

If the ObjectId appears several times in the file and is always associated with the same CategoryId , assign the object to this category.
If the ObjectId appears several times in the file and is associated with different CategoryId s, consider the error and mention it on the preview page.

This seems to have spoiled my strategy Map<CategoryId, Set<ObjectId>> because it does not provide a good way to detect that the ObjectId I just read from a file is already associated with CategoryId .

So my question is how to most effectively detect and track these duplicate ObjectId s?

It occurred to us to use both "forward" and "reverse" cards:

 public CrossRequestContainer { ... Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap Map<ObjectId, List<CategoryId>> categoriesByObject; // HashMap Set<ObjectId> illegalDuplicates; ... }

Then, when each pair (ObjectId, CategoryId) been read, it falls into both cards. After the file has been fully read, I can do:

 for (Map.Entry<ObjectId, List<CategoryId>> entry : categoriesByObject.entrySet()) { List<CategoryId> categories = entry.getValue(); if (categories.size() > 1) { ObjectId object = entry.getKey(); if (!all_categories_are_equal(categories)) { illegalDuplicates.add(object); // Since this is an "illegal" duplicate I need to remove it // from every category that it appeared with in the file. for (CategoryId category : categories) { objectsByCategory.get(category).remove(object); } } } }

When this loop ends, objectsByCategory will no longer contain any "illegal" duplicates and illegalDuplicates will contain all "illegal" duplicates to report as needed. Then I can objectsByCategory over objectsByCategory , get Set<ObjectId> for each category and call assignObjectsToCategory() to complete the assignments.

But while I think this will work, I worry about saving data twice, especially when the input file is huge. And I am also concerned that I am missing something new: efficiency and it will be very slow.

Are there any ways to do this so as not to use dual memory, but can you still start up quickly? Am I missing something that even with dual memory, will still work a lot slower than I expect?

+4

java data-structures duplicate-removal

Quantummechanic Apr 28 '11 at 2:37

source share

1 answer

davmac · Accepted Answer · 2011-04-28T03:00:21+0000

Given the limitations you indicated, I have no way to do this using much less memory.

One of the possible optimizations is to maintain only category lists for objects that are listed in several categories, and otherwise simply map the object to a category, that is:

 Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap Map<ObjectId, CategoryId> categoryByObject; // HashMap Map<ObjectId, Set<CategoryId>> illegalDuplicates; // HashMap

Yes, this adds another container, but it will contain (hopefully) only a few entries; the memory requirements of the ByObject category card are also reduced (cutting out one overhead for each record).

The logic is a little more complicated, of course. When a duplicate is initially detected, the object must be removed from the ByObject category map and added to the illegal duplicate map. Before adding an object to a map of the ByObject category, you must first check the illegal Duplicates map.

Finally, it probably won't hurt performance for creating a map of ByCategory objects in a separate loop after creating two other maps, and this will simplify the code a bit.

Is it efficient to find duplicates in a limited many-to-many data set?

More articles: