Add Field to Lucene Document

Hello, I have a 32 megabyte file. This is a simple dictionary file encoded 1250 with 2.8 million lines in it. Each line has only one unique word:

cat dog god ... 

I want to use Lucene to search for every anagram in a dictionary of a specific word. For instance:

I want to find any anagram of the word dog , and lucene should look for my dictionary and return the dog and god . In my webapp, I have a Word Entity:

 public class Word { private Long id; private String word; private String baseLetters; private String definition; } 

and baseLetters is a variable that is sorted alphabetically to search for such anagrams [the words god and dog will have the same baseLetters: dgo]. I managed to find such anagrams from my database using this baseLetters variable in different services, but I have a problem creating the index of my dictionary file. I know what I need to add to the fields:

word and baseLetters, but I have no idea how to do this :( Can someone show me some directions to achieve this?

Now I only have something like this:

 public class DictionaryIndexer { private static final Logger logger = LoggerFactory.getLogger(DictionaryIndexer.class); @Value("${dictionary.path}") private String dictionaryPath; @Value("${lucene.search.indexDir}") private String indexPath; public void createIndex() throws CorruptIndexException, LockObtainFailedException { try { IndexWriter indexWriter = getLuceneIndexer(); createDocument(); } catch (IOException e) { logger.error(e.getMessage(), e); } } private IndexWriter getLuceneIndexer() throws CorruptIndexException, LockObtainFailedException, IOException { StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_36, analyzer); indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND); Directory directory = new SimpleFSDirectory(new File(indexPath)); return new IndexWriter(directory, indexWriterConfig); } private void createDocument() throws FileNotFoundException { File sjp = new File(dictionaryPath); Reader reader = new FileReader(sjp); Document dictionary = new Document(); dictionary.add(new Field("word", reader)); } } 

PS: One more question. If I register DocumentIndexer as a bean in Spring, will the index be created / added every time I reinstall my webapp? and the same will happen with the future DictionarySearcher?

+4
source share
2 answers

The createDocument () function must be

 private void createDocument() throws FileNotFoundException { File sjp = new File(dictionaryPath); BufferedReader reader = new BufferedReader(new FileReader(sjp)); String readLine = null; while((readLine = reader.readLine() != null)) { readLine = readLine.trim(); Document dictionary = new Document(); dictionary.add(new Field("word", readLine)); // toAnagram methods sorts the letters in the word. Also makes it // case insensitive. dictionary.add(new Field("anagram", toAnagram(readLine))); indexWriter.addDocument(dictionary); } } 

If you use Lucene for great functionality, consider using Apache Solr , a search platform built on top of Lucene.

You can also model your index with only one entry in the anagram group.

 {"anagram" : "scare", "words":["cares", "acres"]} {"anagram" : "shoes", "words":["hoses"]} {"anagram" : "spore", "words":["pores", "prose", "ropes"]} 

This will require updating existing documents in the index when processing a dictionary file. In such cases, Solr will help with a higher-level API. For example, IndexWriter does not support updating documents . Solr supports updates.

Such an index will give one result for each anagram search.

Hope this helps.

+3
source

Lucene is not the best tool for this because you are not doing a search: you are doing a search. All real work takes place in the "indexer", and then you just save the results of all your work. The search can be O (1) in any hash type storage engine.

Here is what your index should do:

  • Read the entire dictionary in a simple structure, such as SortedSet or String[]
  • Create an empty HashMap<String,List<String>> (possibly the same size for performance) to store the results
  • Iterating through the dictionary in alphabetical order (indeed, any order will work, just make sure you hit all entries)
    • Sort letters in a word
    • Search for sorted letters in your vault collection
    • If the search is successful, add the current word to the list; otherwise, create a new list containing this word and place it in the Map repository
  • If you need this card later, save the card to disk; otherwise save it in memory
  • Cancel Dictionary

Here is what your search process should do:

  • Sort letters in a sample word
  • Search for sorted letters in your vault collection
  • Print a List that returns from the search (or null), trying to exclude the sample word from the output

If you want to save a ton of space, consider using DAWG . You will find that you can imagine the entire dictionary of English words in a few hundred kilobytes instead of 32MiB. I will leave this as an exercise for the reader.

Good luck with your homework.

+6
source

All Articles