How to build a conceptual search engine?

I would like to create an internal search engine (I have a very large collection of thousands of XML files) that can display concept queries. For example, if I am looking for “big cats”, I would like to get highly rated results to return documents with “big cats”. But I may also be interested in bringing back “huge animals”, albeit with a much lower relevance score.

I am currently looking at natural language processing in a Python book, and it seems that WordNet has some word mappings that may be useful, although I'm not sure how to integrate them into a search engine. Can I use Lucene for this? How?

From further research, it seems that “hidden semantic analysis” is relevant to what I'm looking for, but I'm not sure how to implement it.

Any tips on how to do this?

+6
python search nlp lucene lsa
source share
4 answers

I'm not sure how to integrate this into a search engine. Can I use Lucene for this? How?

Step 1. Stop.

Step 2. Get something to work.

Step 3. By then you will understand more about Python and Lucene and other tools and how to integrate them.

Do not start trying to solve integration problems. Software can always be integrated. This is what the operating system does. It integrates software. Sometimes you need tougher integration, but never solve the first problem.

The first problem to solve is finding your search object or concept or something else that it will work as a command line application with a dumb old one. Or a couple of applications knit together, transferring files around or knit along with OS pipes or something like that.

Later, you can try and understand how to make the user interface seamless.

But do not start with integration and do not stop because of integration issues. Set integration aside and get something to work with.

+9
source share

This is an incredibly difficult problem, and it cannot be solved in such a way as to always give adequate results. I would suggest sticking to some simple principles so that the results are at least predictable. I think you need 2 things: some basic morphology mechanism plus a dictionary of synonyms.

Whenever a search query arrives, for each word you

  • Look for a literary match
  • “Normalize / canonically” a word using the mechanism of morphology, i.e. make it singular, first shape, etc. and look for matches
  • Search for synonyms for the word

Then repeat for all combinations of input words, that is, "big cats", "big cat", "huge cats" huge cat, etc.

In fact, you also need to store index data in canonical form (singluar, first form, etc.) along with the literal form.

As for concepts, such as cats, are also animals - it becomes difficult here. This never worked because otherwise Google already returned conceptual matches, but it doesn’t.

+1
source share

Firstly, I agree with most of the tips here on how to get started slowly, and first built the bits of this grand plan, developing a minimal first product and continuing from there. Secondly, if you want to use some features of Wordnet with Lucene, there is a tab for interacting Lucene queries with Wordnet. I have no idea if this is transferred to the pill. Good luck and be careful.

+1
source share

First write a piece of code in python that will return you pineapple, orange, papaya when you enter the apple. By focusing on the "is" relationship of the semantic network. Then continue with has a relationship etc.

I think that in the end you can get enough code snippet for a school project.

0
source share

All Articles