Suppose you have only two documents, I like pizza and I like Pasta . Your entire vocabulary consists of these words (I, like, pizza, pasta) For each word in the dictionary there is an index similar to this (1, 2, 3, 4). Now, given a document like I like Pasta , it can be converted to a vector [1, 2, 4]. This is what learn.preprocessing.VocabularyProcessor does. The max_document_length parameter ensures that all documents are represented by the max_document_length length max_document_length either by filling in the numbers if their length is shorter than max_document_length and cutting them off if their length is longer than max_document_length I hope this helps you
source share