Doc2vec: TaggedLineDocument ()

Question

Doc2vec: TaggedLineDocument ()

So, I'm trying to find out and understand Doc2Vec. I follow this tutorial . My input is a list of documents, a list of word lists. This is what my code looks like:

input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...] documents = TaggedLineDocument(input) model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)

But I get some error in unicode (tried this error, but nothing good):

  TypeError('don\'t know how to handle uri %s' % repr(uri))

Can someone please help me understand where I am going wrong? Thanks!

+2

python nlp gensim

Auk Apr 21 '16 at 20:47

source share

2 answers

aberger · Answer 1 · 2016-04-21T21:02:05+0000

TaggedLineDocument must be created from the file path. Make sure the file is configured in a format that one document equals one line.

 documents = TaggedLineDocument('myfile.txt') documents = TaggedLineDocument('compressed_text.txt.gz')

From the source code :

uri (it seems that you are creating a TaggedLineDocument with) could be:

 1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically): `./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2` 2. a URI for HDFS: `hdfs:///some/path/lines.txt` 3. a URI for Amazon S3 (can also supply credentials inside the URI): `s3://my_bucket/lines.txt`, `s3://my_aws_key_id: key_secret@my _bucket/lines.txt` 4. an instance of the boto.s3.key.Key class.

LunaRivolxoxo · Answer 2 · 2017-09-22T08:06:08+0000

For data , I have the same formatted list as yours:

[['aw', 'wb', 'ce', 'uw', 'qqg'], ['g', 'e', 'ent', 'va'], ['a'] .. .]

For shortcuts I have a list: [1, 0, 0 ...] It indicates the class of my sentences above, here you can have any class (tag) (not only 1 or 0)

Since we already have a list as indicated above, we can use a TaggedDocumnet rather than a TaggedLineDocument

  model = gensim.models.Doc2Vec(self.myDataFlow(data,labels)) def myDataFlow(self,data,labels): for i, j in zip(data,labels): yield TaggedDocument(i,[j])

Doc2vec: TaggedLineDocument ()

More articles: