About lda output

Right now, I'm using the LDA theme modeling tool from the MALLET package to do some topic detection in my docs. At first everything is fine, I have 20 topics. However, when I try to conclude about a new document using the model, the result will be somewhat puzzled.

For example, I intentionally run my model on a document that I created manually that contains nothing but keywords from one of the "FLU" themes, but the distributed topics I received were <0.1 for each topic. Then I try to do the same on one of the already selected documents, which has a high score of 0.7 for one of the topics. Again the same thing happened.

Can someone tell me the reason?

I tried asking for the MALLET mailing list, but apparently no one answered.

+4
source share
3 answers

I also know very little about MALLET, but the docs mention this ...

Topic conclusion

- inferencer-filename [FILENAME] Create a theme output tool based on the current, trained model. Use the output from the MALLET bin / mallet --help command to get information about using topic output.

Please note that you must ensure that the new data is compatible with your training data. Use the --use-pipe-from [MALET PREPARATION FILE] option in the MALLET / import-file or import-dir command box to specify the training file.

Did you forget to do this? It sounds like the data you train is not in the same format as the data you are testing.

+2
source

I had the same problem with Malle. Later, I discovered that the problem is that the documents must be read through the tube that was once used to be read in training documents.

Here is a sample to read in the training documents:

ImportExample importerTrain = new ImportExample();//this is an example class in MALLET to import docs. InstanceList training= importer.readDirectory(new File(trainingDir)); training.save(new File(outputFile)); 

When reading in the documents in the topic output:

 InstanceList training = InstanceList.load(new File(outputFile)); Pipe pipe = training.getPipe(); ImportExample importer = new ImportExample(); importer.pipe = pipe; //use the same pipe InstanceList testing = importer.readDirectory(new File(testDir)); 

I got the key from one question posted in their archive: http://thread.gmane.org/gmane.comp.ai.mallet.devel/829

+2
source

Disclosure: I am familiar with the technique and math commonly used to output topics, but I have minimal impact on MALLET.
I hope that these semi-educated guesses will lead you to a solution. No guarantee; -)

I assume that you are using the mallet hlda command to train the model.
A few things that may have gone wrong:

  • Make sure you use the option to save the sequence during the import phase of the process. By default, the hammer saves entries as simple Word Bags, losing the order in which words were originally found. This may be normal for basic classification tasks, but not for modeling topics.
  • Remember that the Gibbs selection used by the hammer is a stochastic process ; expect changes, in particular with small samples. During tests, you can specify the same random seed for each iteration in ensu
  • What is the size of your training data ? The 20 topics seem to have a lot for the initial tests, which are usually based on small, hand-crafted and / or quickly assembled training and testing kits.
  • remember that the conclusion to the topic is based on sequences of words , not on isolated keywords (in your description of a test document created manually, “keywords” are mentioned, not “expressions” or “phrases”)
+1
source

All Articles