Efficiency of guava table and multiple hash maps

I came across some code that was doing something like this:

Map<String,String> fullNameById = buildMap1(dataSource1); Map<String,String> nameById = buildMap2(dataSource2); Map<String,String> nameByFullName = new HashMap<String,String>(); Map<String,String> idByName = new HashMap<String,String>(); Set<String> ids = fullNameById.keySet(); for (String nextId : ids) { String name = nameById.get(nextId); String fullName = fullNameById.get(nextId); nameByFullName.put(fullName, name); idByName.put(name, nextId); } 

I had to look at him for several minutes to understand what was happening. All this boils down to a connection operation by id and inversion of one of the source maps. Since Id, FullName and Name are always 1: 1: 1, it seemed to me that there should be some way to simplify this. I also found that the first two cards are never used again, and I find the code above is a little hard to read. So I'm thinking about replacing this with something like this, which (for me) reads a lot cleaner

 Table<String, String, String> relations = HashBasedTable.create(); addRelationships1(dataSource1, relations); addRelationships2(dataSource2, relations); Map<String,String> idByName = relations.column("hasId"); Map<String,String> nameByFullName = relations.column("hasName"); relations = null; // not used hereafter 

In addRelationships1 I do

 relations.put(id, "hasFullName", fullname); 

And in addRelationships2, where my query gives the values โ€‹โ€‹for id and name , I do

 relations.put(relations.remove(id,"hasFullName"), "hasName", name); relations.put(name, "hasId", id); 

So my questions are:

  • Is there any hidden inefficiency in what I did, either with a processor or memory, or with a GC load? I donโ€™t think so, but I am not familiar with table efficiency. I know that the Table object will not be GC'd after relations = null , I just want to report that it is not used again in a rather long section of the following code.
  • Have I Got Efficiency? I persuade and unconvincingly that I have and not.
  • Do you think this is more readable? Or is it just easy for me to read because I wrote it? I'm a little concerned about this front due to the fact that Table not very well known. On the other hand, at the top level, itโ€™s now quite clearly said: โ€œCollect data from two sources and make these two cards from them.โ€ I also like that this does not leave you wondering if / where two other cards are used (or not).
  • Do you have an even better, cleaner, faster and easier way to do this than any of the above?

Please do not optimize the early / late discussion here. I know this trap well. If it improves readability without sacrificing performance, I am satisfied. Improving efficiency would be a good bonus.

Note. my variable and method names were sanitized here to distract the business sphere from discussion, I will definitely not call them addRelationships1 or datasource1! Similarly, the final code will, of course, use constants, not raw strings.

+4
source share
2 answers

So, I did some mini-benchmarking and came up with the conclusion that the difference in the two methods in terms of runtime is small. I saved the total size of the data being processed constant using trade runs for the size of the data set. I made 4 runs and selected the lowest time for each implementation from all 4 runs. Reassuring both implementations have always been the fastest in the same run. My code can be found here . Here are my results:

 Case Maps (ms) Table (ms) Table vs Maps 100000 runs of size 10 2931 3035 104% 10000 runs of size 100 2989 3033 101% 1000 runs of size 1000 3129 3160 101% 100 runs of size 10000 4126 4429 107% 10 runs of size 100000 5081 5866 115% 1 run of size 1000000 5489 5160 94% 

Thus, for small datasets, using a table looks a little slower. Something interesting happens around 100,000, and then 1 million tables are actually faster. My data will be displayed in the range from 100 to 1000, so at least at runtime, the performance should be almost the same.

Regarding readability, I think that if someone tries to figure out what is happening nearby and reads the code, it will be much easier to see the intention. If they need to debug this bit of code, it can be a little more complicated, since Table is less common and requires some difficulty to understand.

Another thing I'm not sure about is whether it is more efficient to create hash maps or simply query the table directly in case of iteration of all the map keys. However, another question :)

And the comedic end is that in fact, when I analyzed the code further (hundreds of lines), I found that the only significant use of nameByFullname.get () outside of logging (of dubious value) is to pass the result to idByName.get () . Therefore, in the end, I will actually build an idByFullName map and an idByName map without requiring any attachment, and still drop the entire table. But that made for an interesting SO question, I think.

+16
source

tl; dr, but I'm afraid you need to take a bigger step from the original design. Simulating database tables can be a good exercise, but for me your code is not really readable.

  • Is there a hidden inefficiency in what I did ... I donโ€™t know.
  • Have I Got Efficiency? I'm afraid you need to measure this first. Of course, removing some indirectness helps, but using a more complex data structure can compensate for it. And performance is generally too complicated.
  • Do you think this is more readable? I'm afraid not.
  • Do you have an even better, cleaner, faster and easier way to do this than any of the above? I hope so....

Where I got lost in such code is using strings for everything - it's too easy to pass the wrong string as an argument. Therefore, I propose combining them into an object and providing maps for accessing objects through any part of them. Something as trivial as this should do:

 class IdNameAndFullName { String id, name, fullName; } class IdNameAndFullNameMaps { Map<String, IdNameAndFullName> byId; Map<String, IdNameAndFullName> byName; Map<String, IdNameAndFullName> byFullName; } 

Obviously, you replace the IdNameAndFullNameMaps class with Table . However, in addition to using a good existing data structure, I do not see any advantages in it. Disadvantages:

  • loss of effectiveness
  • loss of readability (I would not use Table here for the same reason Tuple should be avoided )
  • using string keys (your "hasId" and "hasName").
+5
source

All Articles