I have a program that accepts an input request and ranks similar documents based on its TFIDF score. The fact is that I want to add a few keywords and treat them as an “input”. These keywords will be different for each request.
For example, if the request "Logic Based Knowledge Representation", the words are as follows:
Level 0 keywords: [logic, base, knowledg, represent]
Level 1 keywords: [tempor, modal, logic, resolut, method, decis, problem,
reason, revis, hybrid, represent]
Level 2 keywords: [classif, queri, process, techniqu, candid, semant, data,
model, knowledg, base, commun, softwar, engin, subsumpt,
kl, undecid, classic, structur, object, field]
I want to consider the result in different ways, for example, for a term in a document that is equal to a word at level 0, I want to multiply the score by 1. For a term in a document that is equal to words at a level 1, multiply with 0.8. And finally, for a term in a document equal to level 2 words, multiply the score by 0.64.
My goal is to expand the input query, but also make sure that documents containing more keywords from level 0 are considered more important, and documents containing keywords from level 1 and 2 are smaller (even though the input is expanded). I did not include this in my program. My program so far only considers the TFIDF account for all documents in the request and evaluates the result:
public class Ranking{
private static int maxHits = 2000000;
public static void main(String[] args) throws Exception {
System.out.println("Enter your paper title: ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String paperTitle = null;
paperTitle = br.readLine();
String querystr = args.length > 0 ? args[0] :paperTitle;
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
Query q = new QueryParser(Version.LUCENE_42, "title", analyzer)
.parse(querystr);
IndexReader reader = DirectoryReader.open(
FSDirectory.open(
new File("E:/Lucene/new_bigdataset_index")));
IndexSearcher searcher = new IndexSearcher(reader);
VSMSimilarity vsmSimiliarty = new VSMSimilarity();
searcher.setSimilarity(vsmSimiliarty);
TopDocs hits = searcher.search(q, maxHits);
ScoreDoc[] scoreDocs = hits.scoreDocs;
PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");
int counter = 0;
for (int n = 0; n < scoreDocs.length; ++n) {
ScoreDoc sd = scoreDocs[n];
float score = sd.score;
int docId = sd.doc;
Document d = searcher.doc(docId);
String fileName = d.get("title");
String year = d.get("pub_year");
String paperkey = d.get("paperkey");
System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
++counter;
}
writer.close();
}
}
-
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;
public class VSMSimilarity extends DefaultSimilarity{
public boolean doBasic = true;
public boolean doSublinear = false;
public boolean doBoolean = false;
public boolean doCosine = true;
public boolean doOverlap = false;
private static final long serialVersionUID = 4697609598242172599L;
public float tf(int freq) {
if (doSublinear){
if (freq > 0){
return 1 + (float)Math.log(freq);
} else {
return 0;
}
} else if (doBoolean){
return 1;
}
return freq;
}
public float idf(int docFreq, int numDocs) {
if (doBoolean || doOverlap){
return 1;
}
return super.idf(docFreq, numDocs);
}
public float queryNorm(float sumOfSquaredWeights){
if (doOverlap){
return 1;
} else if (doCosine){
return super.queryNorm(sumOfSquaredWeights);
}
return super.queryNorm(sumOfSquaredWeights);
}
public float coord(int overlap, int maxOverlap) {
if (doOverlap){
return 1;
} else if (doCosine){
return 1;
}
return super.coord(overlap, maxOverlap);
}
public float computeNorm(String fieldName, FieldInvertState state){
if (doOverlap){
return 1;
} else if (doCosine){
return super.computeNorm(state);
}
return super.computeNorm(state);
}
}
Below is an example of the output of my current program (without increasing the score):
3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663
Can someone please let me know how to add a rating for the conditions mentioned above? Does Lucene provide such a function? Can I integrate it into VSMSimilarity class?
EDIT: I found this in the Lucene documentation:
public void setBoost(float b)
Sets the gain for this query clause to b. Documents matching this proposal will (in addition to normal weights) have a score multiplied by b.
, , . , havent , . , 0 1, 1 0,8,