Converting a single document into a string in the Blei lda-c / dtm format for modeling topics?

Question

Converting a single document into a string in the Blei lda-c / dtm format for modeling topics?

I am doing Latent Dirichlet analyzes for some studies and am constantly confronted with a problem. Most lda programs require documents to be in doclines, that is, a CSV or other delimited file in which each line represents the entire document. However, Blei lda-c and software for a dynamic thematic model requires that the data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]where [M]is the number of unique terms in the document, and [score] associated with each term is the number of times this term appears in the document. Note that [term_1]is an integer that indexes the term; this is not a string.

Does anyone know a utility that will allow me to quickly convert to this format? Thank.

+5

nlp lda

Trey Jan 05 '12 at 10:53

source share

3 answers

Ben · Answer 1 · 2012-12-07T01:39:46+0000

If you are working with R, the package ldacontains a function lexicalizethat converts the raw text into the lda-c format needed for the package lda.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

Similarly, the package topicmodelshas a function dtm2ldaformatthat converts the document term matrix to lda format. You can convert a simple text document into a document term matrix using the package tm, also in R.

Thus, with these existing functions, there is great flexibility in getting text in Rfor modeling topics.

Karsten · Answer 2 · 2013-01-04T15:29:20+0000

Gensim Blei corpus. . . CSV Python, lda-c gensim. .

Mountain · Answer 3 · 2013-02-25T08:52:47+0000

Amherst University of Massachusetts is another option.

And here's an excellent step-by-step demonstration of how to use Mallet:

http://programminghistorian.org/lessons/topic-modeling-and-mallet

You can use mallet with plain text files as input source.

Converting a single document into a string in the Blei lda-c / dtm format for modeling topics?

More articles: