What is the meaning of orth, lemma, tag and pos?
See https://spacy.io/docs/usage/pos-tagging#pos-schemes
What is the difference between printing (word) and printing (word.orth _)
In super brief:
word.orth_ and word.text same. The fact that the cython property ends with an underscore is usually a variable that the developers really did not want to show to the user.
In short:
When you access the word.orth_ property in https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 , it tries to access the index where the entire word dictionary is stored
property orth_: def __get__(self): return self.vocab.strings[self.c.lex.orth]
(For more details see In long below for an explanation of self.c.lex.orth )
And word.text returns a string representation of the word that simply wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text: def __get__(self): return self.orth_
And when you print print(word) , it calls the __repr__ dunder function, which returns word.__unicode__ or word.__byte__ , which points to the word.text variable, see https://github.com/explosion/spaCy/blob/develop /spacy/tokens/token.pyx#L55
cdef class Token: """ An individual token --- ie a word, punctuation symbol, whitespace, etc. """ def __cinit__(self, Vocab vocab, Doc doc, int offset): self.vocab = vocab self.doc = doc self.c = &self.doc.c[offset] self.i = offset def __hash__(self): return hash((self.doc, self.i)) def __len__(self): """ Number of unicode characters in token.text. """ return self.c.lex.length def __unicode__(self): return self.text def __bytes__(self): return self.text.encode('utf8') def __str__(self): if is_config(python3=True): return self.__unicode__() return self.__bytes__() def __repr__(self): return self.__str__()
In the long
Try to go step by step:
>>> import spacy >>> nlp = spacy.load('en') >>> doc = nlp(u'This is a foo bar sentence.') >>> type(doc) <type 'spacy.tokens.doc.Doc'>
After the sentence is passed to the nlp() function, it creates the spacy.tokens.doc.Doc object from the documents:
cdef class Doc: """ A sequence of `Token` objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. Aside: Internals The `Doc` object holds an array of `TokenC` structs. The Python-level `Token` and `Span` objects are views of this array, ie they don't own the data themselves. Code: Construction 1 doc = nlp.tokenizer(u'Some text') Code: Construction 2 doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)]) """
So, the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token objects. Inside the Token object, we see an enumeration of the wave of a cython property , for example. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth: def __get__(self): return self.c.lex.orth
Tracking it, we see that self.c = &self.doc.c[offset] :
cdef class Token: """ An individual token --- ie a word, punctuation symbol, whitespace, etc. """ def __cinit__(self, Vocab vocab, Doc doc, int offset): self.vocab = vocab self.doc = doc self.c = &self.doc.c[offset] self.i = offset
Without thorough documentation, we donβt know what self.c means, but from the look of it it refers to one of the tokens within &self.doc , referring to the Doc doc that was passed to the __cinit__ function. So, most likely, this is a short period for accessing tokens
Looking at Doc.c :
cdef class Doc: def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None): self.vocab = vocab size = 20 self.mem = Pool() # Guarantee self.lex[ix], for any i >= 0 and x < padding is in bounds # However, we need to remember the true starting places, so that we can # realloc. data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC)) cdef int i for i in range(size + (PADDING*2)): data_start[i].lex = &EMPTY_LEXEME data_start[i].l_edge = i data_start[i].r_edge = i self.c = data_start + PADDING
Now we see that Doc.c refers to an array of pointers cython data_start , which allocates memory to store the spacy.tokens.doc.Doc object (please correct me if I get an explanation of <TokenC*> ).
So, returning to self.c = &self.doc.c[offset] , it basically tries to access the memory point where the array is stored, and more specifically access the "offset-th" element in the array.
What is spacy.tokens.token.Token .
Returning to property :
property orth: def __get__(self): return self.c.lex.orth
We see that self.c.lex refers to data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is just an integer that indicates the index of occurrence of the word contained in the inner dictionary spacy.tokens.doc.Doc .
Thus, we see that property orth_ trying to access self.vocab.strings with the te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token .pyx # L162
property orth_: def __get__(self): return self.vocab.strings[self.c.lex.orth]