Configure Assessment for a Specific State in Lucene TFIDF

Question

Configure Assessment for a Specific State in Lucene TFIDF

I have a program that accepts an input request and ranks similar documents based on its TFIDF score. The fact is that I want to add a few keywords and treat them as an “input”. These keywords will be different for each request.

For example, if the request "Logic Based Knowledge Representation", the words are as follows:

Level 0 keywords: [logic, base, knowledg, represent]

Level 1 keywords: [tempor, modal, logic, resolut, method, decis, problem,
                   reason, revis, hybrid, represent]

Level 2 keywords: [classif, queri, process, techniqu, candid, semant, data, 
                   model, knowledg, base, commun, softwar, engin, subsumpt,
                   kl, undecid, classic, structur, object, field]

I want to consider the result in different ways, for example, for a term in a document that is equal to a word at level 0, I want to multiply the score by 1. For a term in a document that is equal to words at a level 1, multiply with 0.8. And finally, for a term in a document equal to level 2 words, multiply the score by 0.64.

My goal is to expand the input query, but also make sure that documents containing more keywords from level 0 are considered more important, and documents containing keywords from level 1 and 2 are smaller (even though the input is expanded). I did not include this in my program. My program so far only considers the TFIDF account for all documents in the request and evaluates the result:

public class Ranking{

    private static int maxHits = 2000000;

    public static void main(String[] args) throws Exception {        
        System.out.println("Enter your paper title: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        String paperTitle = null;
        paperTitle = br.readLine(); 

       // CitedKeywords ckeywords = new CitedKeywords();
       // ckeywords.readDataBase(paperTitle);

        String querystr = args.length > 0 ? args[0] :paperTitle;
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        Query q = new QueryParser(Version.LUCENE_42, "title", analyzer)
            .parse(querystr);

        IndexReader reader = DirectoryReader.open(
                             FSDirectory.open(
                             new File("E:/Lucene/new_bigdataset_index")));        

        IndexSearcher searcher = new IndexSearcher(reader);

        VSMSimilarity vsmSimiliarty = new VSMSimilarity();  
        searcher.setSimilarity(vsmSimiliarty);
        TopDocs hits = searcher.search(q, maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;

        PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");

        int counter = 0;
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            float score = sd.score;
            int docId = sd.doc;
            Document d = searcher.doc(docId);
            String fileName = d.get("title");
            String year = d.get("pub_year");
            String paperkey = d.get("paperkey");
            System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
            writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
        ++counter;
        }
        writer.close();      
    }
}

-

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;

public class VSMSimilarity extends DefaultSimilarity{

    // Weighting codes
    public boolean doBasic     = true;  // Basic tf-idf
    public boolean doSublinear = false; // Sublinear tf-idf
    public boolean doBoolean   = false; // Boolean

    //Scoring codes
    public boolean doCosine    = true;
    public boolean doOverlap   = false;

    private static final long serialVersionUID = 4697609598242172599L;

    // term frequency in document = 
    // measure of how often a term appears in the document
    public float tf(int freq) {     
        // Sublinear tf weighting. Equation taken from [1], pg 127, eq 6.13.
        if (doSublinear){
            if (freq > 0){
                return 1 + (float)Math.log(freq);
            } else {
                return 0;
            }
        } else if (doBoolean){
            return 1;
        }
        // else: doBasic
        // The default behaviour of Lucene is sqrt(freq), 
        // but we are implementing the basic VSM model
        return freq;
    }

    // inverse document frequency = 
    // measure of how often the term appears across the index
    public float idf(int docFreq, int numDocs) {
        if (doBoolean || doOverlap){
            return 1;
        }
        // The default behaviour of Lucene is 
        // 1 + log (numDocs/(docFreq+1)), 
        // which is what we want (default VSM model)
        return super.idf(docFreq, numDocs); 
    }

    // normalization factor so that queries can be compared 
    public float queryNorm(float sumOfSquaredWeights){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.queryNorm(sumOfSquaredWeights);
        }
        // else: can't get here
        return super.queryNorm(sumOfSquaredWeights);
    }

    // number of terms in the query that were found in the document
    public float coord(int overlap, int maxOverlap) {
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return 1;
        }
        // else: can't get here
        return super.coord(overlap, maxOverlap);
    }

    // Note: this happens an index time, which we don't take advantage of
    // (too many indices!)
    public float computeNorm(String fieldName, FieldInvertState state){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.computeNorm(state);
        }
        // else: can't get here
        return super.computeNorm(state);
    }
}

Below is an example of the output of my current program (without increasing the score):

3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663

Can someone please let me know how to add a rating for the conditions mentioned above? Does Lucene provide such a function? Can I integrate it into VSMSimilarity class?

EDIT: I found this in the Lucene documentation:

 public void setBoost(float b)

Sets the gain for this query clause to b. Documents matching this proposal will (in addition to normal weights) have a score multiplied by b.

, , . , havent , . , 0 1, 1 0,8,

+4

java sorting ranking lucene tf-idf

fuschia 12 . '15 5:57

1

Nikolay · Accepted Answer · 2015-02-22T03:14:19+0000

Lucene.

https://lucene.apache.org/core/5_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term

( OR )

logic base knowledge representation temporal^0.8 modal^0.8 classification^0.64...

.

PS: LUCENE_42 . Lucene ( , 2.4.9).

Configure Assessment for a Specific State in Lucene TFIDF

More articles: