I am working on an application that runs on Windows Mobile 6, which should be able to retrieve all the elements from the element table that contain the specified row (provided by the end user) in the element description field. The problem is that the table has about 170,000 items. Since I need to return all elements containing a string anywhere in the description, I have to use the string LIKE %%, which excludes the possibility of using an index. The structure of the data and tables is initially based on the Progress database, which has a wonderful statement that contains all the fields marked with a word. This does not apply to our mobile application because it uses SQL Server Compact 3.5.
Basically, my DAL launches the query and retrieves the SqlCeDataReader, and then uses the ItemFactory to create a List object that contains only the matched elements. This, obviously, allows us to keep our domain / business objects separate from the data access level.
Lovely and dandy, with the exception of 8 m and 42 sec, which is required to extract the elements when I look for all elements that contain something like “golf” in the description. Obviously, this is not an acceptable time frame for the end user.
My first attempt was to instead return all the items from the database using the SELECT * FROM Item "(with the order by clause in one of the main indexed fields). At this point, I ran an IndexOf check when I ran through SqlCeDataReader, and if ItemFactory only added items to the List object, if they contained the requested description text, this improved the speed to 1 m 46. Not too shabby, but still too slow.
Then I tried a different approach, which showed a promise ... almost ... While the application was starting, I tried to create a list containing all the objects of the objects in the database (it takes about 2 minutes to start the query and populate the entire list, but at least this only once when the application initializes ... else ... ugh). After the list is completed, I can easily run queries on this list, doing things like the following (I hope that my syntax is right ... I am not working right now and I do not have Visual Studio on the computer I’m sitting on in):
List<Item> specificItems = AllItems.FindAll(i => i.Description.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0);
This approach brought him down to the 21st. Very nice (still slow, albeit in a great scheme of things). However, the problem is that the memory usage is too big if I load all the items from the database. I had to actually cut off the last 20,000 items (so the 21st time frame would probably be more like 25 seconds) during bootstrapping because an OutOfMemoryException was thrown. According to the memory manager on the emulator, I still had about 20 MB of free memory, but I heard that the process can only have 32 MB or RAM associated with it (not sure if this is true for WM 6, but it appears like this) .
To make sure that this was not because I used the List object to store all the elements (which I created using the necessary capacity in its constructor to avoid dynamic resizing), which I also read, can cause additional memory usage when it fuzziness causes EnsureCapacity, I tried to use the Item [] array (size in advance). This still had a memory problem, and the size difference was negligible.
Enough dawn. I know that I will probably have to somehow limit the records returned from the database using the datareader (via some indexed search in a different type of field), and then most likely will use indexOf for this smaller subset of elements to Get maximum performance (thus skipping the Like statement together). This will force the end user to enter more than just a description search, though (perhaps information about the hierarchy of elements to limit what type of elements need to be searched inside).
Any ideas? Am I going about it wrong?
Thanks for listening (sorry, this post is long, I'm kind of out loud).
Oh, I have to add (just in the summary) that I use:
- Windows Mobile 6
- Sql Server Compact Edition 3.5
- C # 3.5
UPDATE: although the Bloom Filter approach mentioned below seemed interesting, I could not fulfill one requirement (which I did not specifically indicate above). I can’t match the words contained in other words (for example, “club” will not return “clubs”). Because of this, I was forced to use a completely different approach (Kent Fredrick ... thanks for pointing this out). I marked Kent's answer as correct, since his approach was the one that filled most of the requirements (Mitch, you had a similar problem, like the Bloom filter suggested by Jaunder). However, I went a different way (for now ...) than his way.
What I did is pulling all the objects of the object into memory, only with element numbers and descriptions (which keeps it under memory restrictions, however it still causes a longer initialization than I like ... multithreading and loading this information backstage, while the application is running, it can take care of what I think). To do the searches, I wrote my own program. The procedure is written in unmanaged C # code, which uses two pointers and a couple of loops to execute the description and the required matching text. If it finds a match anywhere in the description, it adds the item number to the array. After searching for all the elements, the new query returns to the database and captures only the corresponding item numbers (which is very fast due to the index in the integer field). Then these elements are created in the List with all the information (and not just with the position number and description). The whole operation takes about 5 to 10 seconds (depending on the description), which is good enough at the moment.
I will continue to study further optimization (it may be possible to keep track of how many characters the search query contains ... if there are fewer characters in the description of the element than the required text, the cycle can continue right up to the next paragraph).
Any suggestions are still welcome. At this point, I have marked Kent's answer as “the most correct” for my question.
Props for Dolch to help me write the contained procedure.