SpaCy Documentation for [orth, pos, tag, lema and text]

I am new to spaCy. I added this entry for documentation and simplified it for beginners like me.

import spacy nlp = spacy.load('en') doc = nlp(u'KEEP CALM because TOGETHER We Rock !') for word in doc: print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_) print(word.orth_) 

I am looking to understand what ortik, lemma, tag and pos mean? This code prints the meanings and differences between print(word) vs print(word.orth_)

+7
python cython nlp spacy
source share
2 answers

What is the meaning of orth, lemma, tag and pos?

See https://spacy.io/docs/usage/pos-tagging#pos-schemes

What is the difference between printing (word) and printing (word.orth _)

In super brief:

word.orth_ and word.text same. The fact that the cython property ends with an underscore is usually a variable that the developers really did not want to show to the user.

In short:

When you access the word.orth_ property in https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 , it tries to access the index where the entire word dictionary is stored

 property orth_: def __get__(self): return self.vocab.strings[self.c.lex.orth] 

(For more details see In long below for an explanation of self.c.lex.orth )

And word.text returns a string representation of the word that simply wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

 property text: def __get__(self): return self.orth_ 

And when you print print(word) , it calls the __repr__ dunder function, which returns word.__unicode__ or word.__byte__ , which points to the word.text variable, see https://github.com/explosion/spaCy/blob/develop /spacy/tokens/token.pyx#L55

 cdef class Token: """ An individual token --- ie a word, punctuation symbol, whitespace, etc. """ def __cinit__(self, Vocab vocab, Doc doc, int offset): self.vocab = vocab self.doc = doc self.c = &self.doc.c[offset] self.i = offset def __hash__(self): return hash((self.doc, self.i)) def __len__(self): """ Number of unicode characters in token.text. """ return self.c.lex.length def __unicode__(self): return self.text def __bytes__(self): return self.text.encode('utf8') def __str__(self): if is_config(python3=True): return self.__unicode__() return self.__bytes__() def __repr__(self): return self.__str__() 

In the long

Try to go step by step:

 >>> import spacy >>> nlp = spacy.load('en') >>> doc = nlp(u'This is a foo bar sentence.') >>> type(doc) <type 'spacy.tokens.doc.Doc'> 

After the sentence is passed to the nlp() function, it creates the spacy.tokens.doc.Doc object from the documents:

 cdef class Doc: """ A sequence of `Token` objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. Aside: Internals The `Doc` object holds an array of `TokenC` structs. The Python-level `Token` and `Span` objects are views of this array, ie they don't own the data themselves. Code: Construction 1 doc = nlp.tokenizer(u'Some text') Code: Construction 2 doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)]) """ 

So, the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token objects. Inside the Token object, we see an enumeration of the wave of a cython property , for example. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

 property orth: def __get__(self): return self.c.lex.orth 

Tracking it, we see that self.c = &self.doc.c[offset] :

 cdef class Token: """ An individual token --- ie a word, punctuation symbol, whitespace, etc. """ def __cinit__(self, Vocab vocab, Doc doc, int offset): self.vocab = vocab self.doc = doc self.c = &self.doc.c[offset] self.i = offset 

Without thorough documentation, we don’t know what self.c means, but from the look of it it refers to one of the tokens within &self.doc , referring to the Doc doc that was passed to the __cinit__ function. So, most likely, this is a short period for accessing tokens

Looking at Doc.c :

 cdef class Doc: def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None): self.vocab = vocab size = 20 self.mem = Pool() # Guarantee self.lex[ix], for any i >= 0 and x < padding is in bounds # However, we need to remember the true starting places, so that we can # realloc. data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC)) cdef int i for i in range(size + (PADDING*2)): data_start[i].lex = &EMPTY_LEXEME data_start[i].l_edge = i data_start[i].r_edge = i self.c = data_start + PADDING 

Now we see that Doc.c refers to an array of pointers cython data_start , which allocates memory to store the spacy.tokens.doc.Doc object (please correct me if I get an explanation of <TokenC*> ).

So, returning to self.c = &self.doc.c[offset] , it basically tries to access the memory point where the array is stored, and more specifically access the "offset-th" element in the array.

What is spacy.tokens.token.Token .


Returning to property :

 property orth: def __get__(self): return self.c.lex.orth 

We see that self.c.lex refers to data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is just an integer that indicates the index of occurrence of the word contained in the inner dictionary spacy.tokens.doc.Doc .

Thus, we see that property orth_ trying to access self.vocab.strings with the te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token .pyx # L162

 property orth_: def __get__(self): return self.vocab.strings[self.c.lex.orth] 
+12
source share

1) When you print word , you basically print the Token class from spacy, which is set to print a string from the class. You can see it here . Therefore, it is different from printing word.orth_ or word.text , where they will print a line directly.

2) I'm not sure about word.orth_ , it seems that this is word.text for most cases. For word.lemma_ this is a lemmatization of a given word, for example. is , am , are will be displayed in be in word.lemma_ .

+1
source share

All Articles