I initially tried posting a similar entry on the elasticsearch mailing list ( https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78 ) but didnβt get any useful answers, so I though I would try Stack Overflow This is my first post on SO, so apologize if it doesn't quite fit into the form it is intended for.
I am currently working with the university, helping them implement a set of tests to further refine some of the research they conduct. Their research is based on dynamic schema searches. After spending some time evaluating various open source search solutions, I settled on elasticsearch as a base platform, and I wonder how best to act. I spent about a week looking at the elasticsearch documentation and the code itself, as well as reading the Lucene documentation, but I am struggling to see a clear path forward.
The goal of the project is to provide research using software that they can use to revise search algorithm plugins for verification and refinement. They would like to be able to write the plug-in algorithm in languages ββother than Java, which are supported by JVMs such as Groovy, Python, or Closure, but this is not a strict requirement. Part of this will be to provide them with a front end to run queries and view the output and admin interface for adding documents to the index. I like it all thanks to the very powerful and full REST API. I'm not so sure how to get started with a searchable algorithm.
The researcher's algorithm requires 4 inputs to work:
- Query terms.
- Word (term) x Matrix of documents by index.
- X Word document (term) matrix by index.
- List of words (term) by index. Here's how many times each word appears throughout the index.
For their purposes, the document does not correspond to the real real document (they actually call them text events). On the contrary, at the moment this corresponds to one sentence (perhaps this custom one may be useful). I believe the best way to handle this is to split the documents into your sentences (using Apache Tika or something like that) by putting each sentence as its own document in the index. I am sure I can do this in the admin user interface, which I provide with the map adapter plugin as a starting point. The disadvantage is that paginating a document before giving it elasticsearch is not a very custom way to execute it. If they want to change the resolution to their algorithm, they will need to add all the documents to the index again. If the index stores the complete documents as is, and the search algorithm can choose which resolution will work for each query, this will be ideal. I'm not sure if this is possible or not, though.
The next problem is how to get the three input data that they require and pass it into their plug-in search algorithm. I am really afraid of where to start from this. It seems, looking at Luecene, I need to provide my own search / query implementation, but I'm not sure if this is correct or not. There are also no search plugins listed on elasticsearch, so I'm not even sure if this is possible. The important thing here is that the algorithm should work at the index level using the query conditions available to generate its schema before using the schema to evaluate each document in the index. From what I can say, this means that the scripting interface provided by elasticsearch will not be useful. The description of the scripting interface in the elasticsearch reference makes it look like a script works at the document level, and not at the index level. Other problems / considerations are the ability to program this algorithm in several languages ββ(as well as the scripting interface) and the ability to increase what the REST API returns for search to include the circuit generated by the algorithm (which, I believe, means I will need to define my own REST endpoint).
Can someone give me some tips on where to start here? It seems like I will have to write my own search plugin that can accept scripts as the main algorithm. The plugin will be responsible for organizing the 4 inputs that I described earlier before passing control to the script. He will also be responsible for outputting the results from the script and returning it through the native REST API. Does this seem logical? If so, how do I start with this? What parts of the code do I need to see?