How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

Question

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

I have two data sets:

tag set (single type words php, htmletc.)
typing

Now I want to build a Term-Document-Matrix that represents the number of occurrences of an element tagsin an element text.

I looked at the R library tm and function TermDocumentMatrix, but I see no way to specify tags as input.

Is there any way to do this?

I am open to any tool (R, Python, etc.), although using R will be great.

Set the data as:

TagSet <- data.frame(c("c","java","php","javascript","android"))
colnames(TagSet)[1] <- "tag"

TextSet <- data.frame(c("How to check if a java file is a javascript script java blah","blah blah php"))
colnames(TextSet)[1] <- "text"

Now I would like to have TermDocumentMatrix TextSet according to TagSet.

I tried this:

myCorpus <- Corpus(VectorSource(TextSet$text))
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE, stopwords=TRUE))


>inspect(tdm)
A term-document matrix (7 terms, 2 documents)

Non-/sparse entries: 8/6
Sparsity           : 43%
Maximal term length: 10 
Weighting          : term frequency (tf)

            Docs
Terms        1 2
  blah       1 2
  check      1 0
  file       1 0
  java       2 0
  javascript 1 0
  php        0 1
  script     1 0

but checking the text for words of the text, while I want to check for the presence of already defined tags.

+4

r term-document-matrix

tucson 31 . '13 11:56

2

DocumentTermMatrix(docs, list(dictionary = Dictionary$Var1))

,

+1

Fiona_Wang 19 . '16 9:45

jwijffels · Accepted Answer · 2013-10-31T14:26:29+0000

tdm.onlytags <- tdm[rownames(tdm)%in%TagSet$tag,]

, .

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

More articles: