The memory search index for the application takes up too much memory - any suggestions?

In our desktop application, we implemented a simple search engine using an inverted index .

Unfortunately, some of our users' datasets can be very large, for example. occupying ~ 1 GB of memory before the inverted index was created. An inverted index by itself takes up most of the memory, almost as much as indexed data (another 1 GB of RAM).

Obviously, this creates problems with memory errors, since the 32-bit Windows limit of 2 GB of memory per application falls, or users with smaller computers are trying to cope with the memory requirement.

Our inverted index is saved as:

Dictionary<string, List<ApplicationObject>> 

And this is created during data loading, when each object is processed in such a way that the applicationObject keyword string and description words are stored in an inverted index.

So my question is: is it possible to maintain a search index more efficiently spatially? Perhaps you need to use a different structure or strategy? Alternatively, you can create a kind of CompressedDictionary? Since it stores a lot of lines, I expect it to be highly compressible.

+6
optimization c # memory search search-engine
source share
7 answers

If it is 1 GB ... put it on a disk. Use something like Berkeley DB. It will be very fast.

Here is the project that provides it with the .net interface:

http://sourceforge.net/projects/libdb-dotnet

+3
source share

I see several solutions:

  • If you have ApplicationObjects in an array, save only the index - maybe less.
  • You can use a little C ++ / CLI to store the dictionary using UTF-8.
  • Don't worry, keeping all the different lines, use Trie
+3
source share

I suspect you will find that you have a lot of very small lists.

I suggest you find out what frequency is: how many words in your vocabulary entries have one list of elements, how many two lists of elements, etc. You could potentially store several separate dictionaries - one for "I just got" one element "(direct matching), then" I have two elements "(map the Pair structure with two links in), etc., until it will become stupid - it’s quite possible, about 3 entries - at this moment you return to the normal lists, Encapsulate the entire batch in a simple interface (add I / O entries), so you will have much less wasted space (mostly empty buffers, accounts, etc.).

If that doesn't really matter, let me know and I'll try to come up with some code.

+3
source share

I agree with bobwienholt, but if you are indexing datasets, I assume they came from a database somewhere. Would it be wise to simply search for this with a search engine like DTSearch or Lucene.net ?

+1
source share

You could take the approach that Lucen did. Firstly, you create a stream of free information in RAM (System.IO.MemoryStream), this stream mirrors the disk, but only part of it (if you have the wrong part, load another one from the disk), This causes one headache , your dictionary requires a file display format. Wikipedia has a description of the paging technique .

In a file mapping script. If you open Reflector and reflect the Dictionary class, you will see that it consists of buckets. You can probably use each of these buckets as a page file and a physical file (thus, inserts are faster). Then you can also freely delete values ​​by simply pasting the value "item x deleted" into the file, and each one so often clears the file.

By the way, buckets contain values ​​with the same hashes. It is very important that your values ​​that you store override the GetHashCode () method (and the compiler will warn you about Equals (), so override this). If you do this, you will get a significant increase in search speed.

+1
source share

How about using the mapped Win32 API memory file to transparently support your memory structure?

http://www.eggheadcafe.com/articles/20050116.asp contains the PInvokes necessary to enable it.

+1
source share

Is only an index added or will you remove keys from it?

0
source share

All Articles