In representing large amounts of data in memory

Question

In representing large amounts of data in memory

I am currently working on a project where I need to bring GB data to a client machine in order to perform some task, and the task requires whole data because it does some data analysis and helps in the decision-making process.

so the question is what are the best methods and the appropriate approach for managing this large amount of data in memory without hampering the performance of the client machine and the application.

Note: while loading the application, we may spend time transferring data from the database to the client machine, which is quite acceptable in our case. but as soon as the data is loaded into the application at startup, performance is very important.

+7

c #

Abhash786 Sep 13 '12 at 7:01

source share

1 answer

Marc gravell · Answer 1 · 2012-09-13T07:21:44+0000

It is a little difficult to answer without reporting a problem, that is, what problems you are currently facing, but the following are just some of the thoughts based on recent experience we had in a similar scenario. However, it’s a lot of work to move on to a model of this type - so it also depends on how much you can invest in trying to “fix” it, and I can’t promise that “your problems” are the same as “ours problems, "if you understand what I mean. So do not cross if the following approach does not work for you!

Loading a lot of data into memory will always have some effect, however I think I see what you are doing ...

When loading this amount of data, you will naively have many (millions?) Objects and a similar or more links. Obviously, you will want to use x64, so the links will add up, but in terms of performance, Biggest's problem will be garbage collection. You have many objects that can never be collected , but the GC will find out that you are using a ton of memory and will try periodically. Here's what I looked at in more detail here, but the following graph shows the impact - in particular, these “spikes" - these are all GC killings:

For this scenario (a huge amount of downloaded data, never released), we switched to using structures, that is, loaded the data into:

struct Foo { private readonly int id; private readonly double value; public Foo(int id, double value) { this.id = id; this.value = value; } public int Id {get{return id;}} public double Value {get{return value;}} }

and saved them directly in arrays (not in lists):

 Foo[] foos = ...

the significance of this is that since some of these structures are quite large, we did not want them to copy themselves a lot on the stack, but with an array that you can do:

 private void SomeMethod(ref Foo foo) { if(foo.Value == ...) {blah blah blah} } // call ^^^ int index = 17; SomeMethod(ref foos[index]);

Please note that we passed the object directly - it was never copied; foo.Value actually looks directly inside the array. A complex bit starts when you need relationships between objects. You cannot store the link here, since it is a struct , and you cannot save it. However, you can save the index (into an array). For example:

 struct Customer { ... more not shown public int FooIndex { get { return fooIndex; } } }

Not as convenient as customer.Foo , but the following works beautifully:

 Foo foo = foos[customer.FooIndex]; // or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);

Key points:

now we use half the size for "links" ( int - 4 bytes, x64 reference - 8 bytes)
we do not have in memory several million object headers
we do not have a huge graph of objects for the GC; only a small number of arrays that the GC can look incredibly fast.
but it is a little less convenient to work with and requires some initial processing at boot

additional notes:

Lines
- this is a killer; if you have millions of lines, then this is problematic; at a minimum, if you have strings that are repeated, make sure you are doing some kind of custom internment (not string.Intern , that would be bad) to make sure that you only have one instance of each repeated value, not 800,000 lines with the same content
If you have duplicate data of finite length and not sub-lists / arrays, you can consider fixed array; this requires unsafe code, but avoids another set of objects and links

As an additional footnote, with so much data, you should seriously think about your serialization protocols, that is, how you send data over the wires. I highly recommend staying away from things like XmlSerializer , DataContractSerializer or BinaryFormatter . If you want pointers to this topic, let me know.

In representing large amounts of data in memory

More articles: