Converting a single document into a string in the Blei lda-c / dtm format for modeling topics?

I am doing Latent Dirichlet analyzes for some studies and am constantly confronted with a problem. Most lda programs require documents to be in doclines, that is, a CSV or other delimited file in which each line represents the entire document. However, Blei lda-c and software for a dynamic thematic model requires that the data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]where [M]is the number of unique terms in the document, and [score] associated with each term is the number of times this term appears in the document. Note that [term_1]is an integer that indexes the term; this is not a string.

Does anyone know a utility that will allow me to quickly convert to this format? Thank.

+5
source share
3 answers

If you are working with R, the package ldacontains a function lexicalizethat converts the raw text into the lda-c format needed for the package lda.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE) 

Similarly, the package topicmodelshas a function dtm2ldaformatthat converts the document term matrix to lda format. You can convert a simple text document into a document term matrix using the package tm, also in R.

Thus, with these existing functions, there is great flexibility in getting text in Rfor modeling topics.

+4
source

Gensim Blei corpus. . . CSV Python, lda-c gensim. .

+2

Amherst University of Massachusetts is another option.

And here's an excellent step-by-step demonstration of how to use Mallet:

You can use mallet with plain text files as input source.

+2
source

All Articles