Recursive data processing performance using Java and SQLite

If you have a non-Java / SQLite related answer, I would be happy to read it.

Environment

I store items in a database with the following schema:

################### # Item # ################### # _id # This is the primary key # parent_id # If set, it the ID of the item containing this item # date # An ordinary date # geocontext_id # Foreign key to a pair of named coordinates ################### ################### # Geocontext # ################### # _id # This is the primary key # name # Way for the user to label a pair of coordinates (eg : "home", "work") # x # One of the coordinate # y # The other one ################### 

Problem

I have to filter elements according to geocontext and date. It would be an easy task if all the objects were on the same level, but the trick is recursive. EG:

 root |_item 1 |_item 2 | |_item 4 | |_item 5 | |_item 6 |_item 3 | |_item 8 | |_item 10 |_item 11 | |_item 12 |_item 7 

There is no explicit limit to recursive depth.

Now, if we are in any node and filter with the date "April 1", we should not only see the elements directly contained in the node that correspond to the date, but we should see the elements containing the elements corresponding to the date as well .

EG: We are in "Items 2", if "item 6" corresponds to a date, then we consider that "element 5" also corresponds to a date, and we must save it. If we are at the root, then element 2 should be displayed.

The same applies to the geocontext, but it is even more complicated because:

  • It is stored in another table.
  • Matching the context is an expensive math calculation.

Of course, crude compliance forcing will slow down the software and lead to a very bad user.

NOTE. I do not need to display the tree . I am showing a list of filtered data from a tree. We should see only a flat list of the top elements. The challenge is to decide whether to display each element or not, depending on all the children. Hierachy.

How I tried to solve it

I thought I could alleviate the problem a bit by using more tables to cache flat data:

 ################### # Geocontex_cache # ################### # item_id # I can Join the items table on this field # child_id # I can delete / update a child, and so delete / update the cache # geocontext_id # I can delete / update a geocontext, and so delete / update the cache # x # Here, I can brute force :-) # y # ################### ################### # Date_cache # ################### # item_id # # child_id # # date # ################### 

This seems reasonable, but I have not tried it yet. However, this should have the following disadvantages:

  • I shifted the expensive process to / set / create / delete, which would have to manage the cached date. This will be nasty code to write and maintain. The five depths level element will be a trigger that will hit five parents recursively.

  • The size of the database can be HUGE. Five levels of depth caches data for five parents. I don’t know if this matters, since it is a single-user application with manual input. I don’t think anyone will introduce more than 1000 items with more than 10 depth levels.

Now the good news: we go from the bottom of the pyramid at the top, and not otherwise as it seems. When I have to deal with the parent element deletion, this will be another nice headache, but I keep it for another question; -).

Now my question

How can you store data and handle filtering in the most optimal way?

Additionally:

Should I define an explicit recursive depth limit? Should I perform filtering using SQL or Java? SQL will certainly be faster, but geocontext mapping is much easier to do in Java.

Since I work on the Android platform, I have the following limitations:

  • Java is the only language available, not with the entire standard library.

  • SQLite is the only DBMS available.

  • Performance and memory are important issues. If you need to choose, battery life and therefore performance is a priority.

  • Exotics external libraries may not be available for use.

PS: I burst into SO and found interesting pieces of information (espacially What is the most efficient / elegant way to parse a flat table into a tree? ), This is a hint, but not a problem.

+4
source share
4 answers

1) First, let's see how easy it is to put everything in memory. It is a simple, flexible and, above all, quick solution. Disadvantages include the fact that you have to read everything in memory at startup (give the user a pretty loading panel, and they won’t even notice), and you may have to work a bit to get everything reflected on the disk when the user considers it to be so that data is not lost.

In this analysis, I make some general assumptions about Android / Dalvik that I really don’t know about, so I hope it will be somewhat accurate :) Remember that G1 has 192 MB of RAM. In addition, your guess has exceeded about 1000 points.

 Object superclass ~ 8 bytes parent/child pointer ~ 4 bytes date (long) ~ 8 bytes name (non interned string avg 32 chars) ~ 64 bytes x point (int) ~ 4 bytes y point (int) ~ 4 bytes Total = 92 bytes + possible memory alignment + fudge factor = 128 bytes 1000 items = 125kB 10000 items = 1.22MB 

Note. I understand that although a child can have only one pointer, a parent can have multiple children. However, the number of parent-> child pointers is (elements-1), so the average cost of a parent> child pointer is (element-1) / elements ~ 1 element or 4 bytes. This assumes a child structure that does not allocate unused memory, such as LinkedList (as opposed to ArrayList).

2) The nerd in me says that this would be a fun place to profile the B + tree, but I think the overflow is for what you want at the moment :) However, whatever solution you may end up if you don't keep everything in mind , you will definitely want to cache as many upper levels of the tree in memory as possible. This can dramatically reduce disk activity.

3) If you do not want to use all the memory, another possible solution might be the following. Bill Carwin offers a rather nifty RDBMS structure called the Closing Table to optimize tree-based reads, making writing more difficult. Combining this with a top-level cache can give you performance benefits, although I would try this out before I say my word:

When evaluating a performance, use what you have in mind to evaluate as many children as you can. For those children that do not match, use the SQL join between the close table and the flat table with the corresponding where clause to find out if there are matching children. If so, you will see that node is in the list of results.

Hope this all makes sense and it looks like it will work for what you need.

+5
source

I listened to Sonil and tried to "close the table." I added the following table:

 ################ # Closure # ################ # ancestor_id # # item_id # ################ 

If you, like me, you have never used this model before, it works just like this:

You add a row for each direct or indirect relationship in the hierarchy. If C is a child of B and B is a child of A, then you have:

 ancestor item BC AB AC # you add the indirect relationship AA BB CC # don't forget any item is in relation with himself 

However, with this scheme you lack important information: what are the direct relationships? What if you want only direct children of the item?

To do this, you can add the is_direct column with bool in the closure table, or simply save the parent_id column in the item table. This is what I did because it prevents me from rewriting a lot of my previous code.

The good part is that now I can check if the item matches the date or geotext in one request.

EG, if I look through all the elements contained in element number 4, and I want to get only those that correspond or contain children corresponding to the date D:

 SELECT ti.parent_id, ti.id, ti.title FROM item AS di # item to filter with the date JOIN closure AS c # closure table ON (di.id = c.item_id) JOIN item AS ti ON (c.ancestor_id = ti.id) # top item to display WHERE di.date = D # here you filter by date AND ti.parent_id = 4 # here you ensure you got only the top items 

So I can throw away all my *_cache . I still have a lot of work to do one UPDATE / DELETE / CREATE , but everything is centralized, and most of it is procedural, not recursive. Pretty awesome.

The only pain is that I have to recursively add an element to its entire ancestor. But getting ancestors is one request, so it’s really reasonable. And of course, the closing table takes up a lot of space, but in my case, I don't care. Remember to specify it if you are looking for perfs ...

Love this SQL trick, thanks a lot of guys! This is a little difficult to do at first glance, but so obvious once you have done it.

+2
source

It may be offtopic, but .. did you think you used serialization?

Google protocol buffers can be used to serialize data in a very efficient way (time and space), then you will need to create a suitable tree structure (look in any CS book) to help with the search.

I mentioned protocol buffers because, as a Google library, they can be accessed on Android.

Just a thought.

+1
source

AFAICT you can use hierarchical queries (google for "CONNECT BY" "START WITH") in SQLite ...

-1
source

All Articles