Data structure issue

I have a database table with many rows and one numeric column, and I want to represent this data in memory. I could just use one large integer array, and it will be very fast, but the number of rows may be too large for that.

Most rows (over 99%) have a value of 0. Is there an efficient data structure that I could use that will allocate memory only for rows with nonzero values โ€‹โ€‹and will be almost as fast as an array?

Update: As an example, I tried using Hashtable by reading the source table and adding non-zero values โ€‹โ€‹entered by the row number in the source table. I got a value with a function that returned 0 if the requested index was not found, otherwise the value is in a Hashtable. This works, but slower than dirt compared to a regular array - I may not be doing it right.

Update 2: here is a sample code.

private Hashtable _rowStates; private void SetRowState(int rowIndex, int state) { if (_rowStates.ContainsKey(rowIndex)) { if (state == 0) { _rowStates.Remove(rowIndex); } else { _rowStates[rowIndex] = state; } } else { if (state != 0) { _rowStates.Add(rowIndex, state); } } } private int GetRowState(int rowIndex) { if (_rowStates.ContainsKey(rowIndex)) { return (int)_rowStates[rowIndex]; } else { return 0; } } 
+4
source share
8 answers

I would expect a map / dictionary / hash table of non-zero values โ€‹โ€‹to be a quick and economical solution.

In Java, using the Hashtable class, a lock will be entered, since it must be thread safe. Perhaps something like this slowed down your implementation.

--- update: using Google-fu, it can be assumed that C # Hashtable carries the overhead of thread safety. Try a dictionary instead.

+2
source

This is an example of a rare data structure and there are many ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use them. Two strategies are possible:

  • Keep only non-zero values. For each element other than zero, store a pair (index, value), all other values โ€‹โ€‹are known to be zero by default. You will also need to keep the total number of items.
  • Compression of consecutive zero values. Store several pairs (quantity, value). For example, if you have 12 zeros in a row, followed by 200 and another 22 zeros, then save (12, 0), (1, 200), (22, 0).
+3
source

How exactly you are not going to implement it, depends on your requirements, this is a compromise between memory and speed. A pure integer array is the fastest, with a constant search for complexity.

Using a hash-based collection like Hashtable or Dictionary (Hashtable seems to be slower, but thread-safe - as others have pointed out) will give you very low memory usage for a sparse data structure like yours, but can be slightly more expensive when doing a search. You store a key-value pair for each index and non-zero value.

You can use ContainsKey to find out if a key exists, but it is much faster to use TryGetValue to check and fetch data at a time. For dense data, it may be useful to catch exceptions for missing elements, as this will only lead to value in the exceptional case, and not to every search.

Edited again when I am confused - this will teach me to publish when I should sleep.

+1
source

You pay a boxing penalty using Hashtable. Try switching to Dictionary<int, int> . Also, how many lines are we talking about - and how fast do you need it?

+1
source

Create an integer array for non-zero values โ€‹โ€‹and indicators for storing bit arrays if a specific string contains a non-zero value.

Then you can find the required element in the first array summing the bits in the second array, starting from the index position from 0 to the row.

0
source

I am not sure about the effectiveness of this solution, but you can try. Therefore, it depends on which scenario you will use, but I will write here two of them that I have in mind. The first solution is if you have only one integer, you can simply use a common list of integers:

 List<int> myList = new List<int>(); 

The second is almost the same, but you can create a list of your own type, for example, if you have two fields, count and a nonzero value, you can create a class that will have two properties, and then you can create a list of your class and store information in it. But you can also try generic link lists. Thus, the code for solution two can be like this:

 public class MyDbFields { public MyDbFields(int count, int nonzero) { Count = count; NonZero = nonzero; } public int Count { get; set; } public int NonZero { get; set; } } 

Then you can create a list like this:

 List<MyDbFields> fields_list = new List<MyDbFields>(); 

and then fill it with data:

 fields_list.Add(new MyDbFields(100, 11)); 

I'm not sure if this will completely help you solve your problem, but just my suggestion.

0
source

If I understand correctly, you cannot just select non-zero rows, because for each row index (or PK value), your data structure will need to be able to report not only the value, but also whether it is there at all. So, assuming 0, if you didn't find it in your data structure, it might not be a good idea.

Just to make sure - exactly how many lines are we talking about here? Millions? A million integers would occupy only 4 MB of RAM as an array. Not really. I think it should be at least 100'000'000 rows.

Basically, I would suggest a sorted array of integer pairs to store non-zero values. The first element in each pair will be the PK value, and this is what the array will sort. The second element will be the value. You can choose a database that, of course, returns only these nonzero values. Since the array will be sorted, you can use binary search to find your values.

If there are no "holes" in the PK values, then the only thing you need, besides this, is the minimum and maximum PK values โ€‹โ€‹so that you can determine if the given index belongs to your data set.

If unused PK values โ€‹โ€‹are used between used ones, you will need a different mechanism to determine which PK values โ€‹โ€‹are valid. Perhaps a bitmask or other array of real (or invalid, whichever is less) PK values.

If you choose the bit mask method, there is another idea. Use two bits for each PK value. The first bit indicates whether the PK value is valid or not. The second bit indicates whether it is equal to zero or not. Store all nonzero values โ€‹โ€‹in another array. This, however, will have the disadvantage that you will not know which element of the array corresponds to which bitmask. You would have to count all the way from the start to find out. This can be mitigated by some indices. Let's say for every 1000 entries in the value array you store another integer that tells you where this entry is in the bitmask.

0
source

Perhaps you are looking for the wrong area - all you store for each value is the row number of the database row, which suggests that maybe you are just using this to retrieve the row? Why not try indexing the table in a numerical column - this will provide quick access to the table rows for any given numerical value (which, apparently, is the ultimate goal here?) If it is still too slow, you can move the index itself into memory and etc. I want to say that your database can solve this problem more elegantly than you can.

0
source

All Articles