I need to write a version of the operational operation on something our webapp allows you to do a more limited basis from the user interface. The desired operation is to assign category objects. A category can have several objects, but a given object can be in only one category.
Workflow for the task:
1) Using a browser, a file of the following form is downloaded:
A file will most likely have tens to hundreds of lines, but it can definitely have thousands of lines.
In an ideal world, this object identifier will only occur once in a file (which reflects the fact that an object can be assigned to only one category) But since the file is created outside our control, there is no guarantee that this is true, and processing should deal with this opportunity.
2) The server will receive the file, analyze it, pre-process it and show the page like this:
723 objects to be assigned to 126 categories 142 objects not found 42 categories not found Do you want to continue? [Yes] [No]
3) If the user clicks Yes , the server will actually do the work.
Since I do not want to analyze the file in both steps (2) and (3), since part (2), I need to build a container that will live through queries and conduct a useful presentation of data that will allow me to easily provide data to fill the page preview and let me do the actual work efficiently. (Although it is obvious that we have sessions, we usually store a very small session state in memory.)
There is an existing
assignObjectsToCategory(Set<ObjectId> objectIds, CategoryId categoryId)
which is used when assignment is done through the user interface. it is highly desirable that the bulk operation also use this API, as it does a bunch of different business logic in addition to the simple one, and we need the same business logic to run when this main one is assigned.
Initially, it would be nice if the file "illegally" specified several categories for this object - it would be good to assign an object to one of the categories associated with it file c.
So, I initially thought that in step (2), when I went through the file that I created and placed in the cross-request container a Map<CategoryId, Set<ObjectId>> (in particular, HashMap for quick search and insert), and then when it's time to do the work, I could just go to the map and for each CategoryId pull out the associated Set<ObjectId> and pass them to assignObjectsToCategory() .
However, the requirement for handling duplicate ObjectId has changed. And now they need to be processed as follows:
- If the
ObjectId appears several times in the file and is always associated with the same CategoryId , assign the object to this category. - If the
ObjectId appears several times in the file and is associated with different CategoryId s, consider the error and mention it on the preview page.
This seems to have spoiled my strategy Map<CategoryId, Set<ObjectId>> because it does not provide a good way to detect that the ObjectId I just read from a file is already associated with CategoryId .
So my question is how to most effectively detect and track these duplicate ObjectId s?
It occurred to us to use both "forward" and "reverse" cards:
public CrossRequestContainer { ... Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap Map<ObjectId, List<CategoryId>> categoriesByObject; // HashMap Set<ObjectId> illegalDuplicates; ... }
Then, when each pair (ObjectId, CategoryId) been read, it falls into both cards. After the file has been fully read, I can do:
for (Map.Entry<ObjectId, List<CategoryId>> entry : categoriesByObject.entrySet()) { List<CategoryId> categories = entry.getValue(); if (categories.size() > 1) { ObjectId object = entry.getKey(); if (!all_categories_are_equal(categories)) { illegalDuplicates.add(object); // Since this is an "illegal" duplicate I need to remove it // from every category that it appeared with in the file. for (CategoryId category : categories) { objectsByCategory.get(category).remove(object); } } } }
When this loop ends, objectsByCategory will no longer contain any "illegal" duplicates and illegalDuplicates will contain all "illegal" duplicates to report as needed. Then I can objectsByCategory over objectsByCategory , get Set<ObjectId> for each category and call assignObjectsToCategory() to complete the assignments.
But while I think this will work, I worry about saving data twice, especially when the input file is huge. And I am also concerned that I am missing something new: efficiency and it will be very slow.
Are there any ways to do this so as not to use dual memory, but can you still start up quickly? Am I missing something that even with dual memory, will still work a lot slower than I expect?