Writing an inverted index in C # for an information retrieval application

I am writing an internal application containing several pieces of text information, as well as a series of pieces of data about these pieces of text. These pieces of data will be stored in the database (SQL Server, although this may change) in the input order.

I would like to be able to search for the most relevant of these pieces of information, the most relevant of which should be at the top. I originally studied using SQL Server full-text search, but not as flexible for my other needs as I had hoped, so it seems to me that I need to develop my own solution for this.

From what I understand, I need an inverted index , and then for the contents of the specified inverted index, which needs to be restored and changed based on the results of the additional information (although at the moment this can be left at a later date, since I just want inverted index indexed body text from database table / rows).

I had a problem writing this code in Java using a Hashtable with a key as words and a value as a list of occurrences of a word, but to be honest, I'm still pretty new to C # and have only really used things like DataSet and DataTables when processing information. If necessary, I will download Java code soon, as soon as I cleaned this laptop from viruses.

If a set of records is specified from a table or from a list of rows, how can I create an inverted index in C #, which will preferably be stored in a DataSet / DataTable?

EDIT: I forgot to mention that I have already tried Lucene and Nutch, but I require that my own solution, as a modification of Lucene to meet my needs, take much longer than writing an inverted index. I will process a lot of metadata that also needs to be processed after the basic inverted index is complete, so all I need now is a basic full-text search in one area using the inverted index. Finally, working on an inverted index is not something that I get every day, so it would be great if it had a crack.

+6
source share
3 answers

Here is a rough overview of the approach that I have successfully used in C # in the past:

struct WordInfo { public int position; public int fieldID; } Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>(); public void BuildIndex() { foreach (int fieldID in GetDatabaseFieldIDS()) { string textField=GetDatabaseTextFieldForID(fieldID); string word; int position=0; while(GetNextWord(textField,out word,ref position)==true) { WordInfo wi=new WordInfo(); if (invertedIndex.TryGetValue(word,out wi)==false) { invertedIndex.Add(word,new List<WordInfo>()); } wi.Position=position; wi.fieldID=fieldID; invertedIndex[word].Add(wi); } } } 

Notes:

GetNextWord () iterates through the field and returns the next word and position. To implement this, use string.IndexOf () and char (IsAlpha, etc.) character type checking methods.

GetDatabaseTextFieldForID () and GetDatabaseFieldIDS () are self-evident, implement as needed.

+4
source

Lucene.net may be your best bet. Its a mature full-text search engine using inverted indexes .

http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx

UPDATE:

I wrote a small library for indexing in memory collections using Lucene.net - this may be useful for this. https://github.com/mcintyre321/Linqdex

+2
source

If you want to create your own, the Dictionary<T> class is likely to be your base, for example your Java hash tables. As for what is stored as values ​​in a dictionary, it is difficult to determine based on the information you provide, but usually search algorithms use some Set structure so that you can run joins and intersections. LINQ provides you with most of this function on any IEnumerable , although the specialized Set class can improve performance.

One such set implementation is in Wintellect PowerCollections . I'm not sure if this will give you any performance advantage or not LINQ related.

As for storing in a DataSet , I'm not sure what you imagine. I do not know anything that is "automatically" written to the DataSet . I suspect that you will have to write this yourself, especially since you have said several times that other third-party options are not flexible enough.

+1
source

All Articles