I am doing some text analysis work in Python. Unfortunately, I need to switch to R in order to use a specific package (unfortunately, a package cannot be replicated in Python easily).
Currently, the text is analyzed for the bigram number, reduced to a vocabulary of about 11,000 bigrams, and then saved as a dictionary:
{id1: {'bigrams':[(bigram1, count), (bigram2, count), ...]}, id2: {'bigrams': ...}
I need to get this in dgCMatrix in R, where the rows are id1, id2, ... and the columns are different bigrams, so the cell is the โcountโ for that id-bigram.
Any suggestions? I was thinking of expanding it only to massive CSV, but it seems super inefficient and probably unacceptable due to memory limitations.
Craig source share