I have expanded and adjusted code samples from sense2vec.
You go from this input text:
“As for Saudi Arabia and its motives, it’s also very simple. Good money and arithmetic. Faced with the painful choice of losing money, maintaining current production at $ 60 a barrel or taking two million barrels a day from the market and losing a lot more money "It's a simple choice: to take a path that is less painful. If there are secondary reasons, such as pain in the US, poor oil producers or damage Iran and Russia, it's great, but it's really just about the money."
For this:
as | ADV far | ADV as | ADP saudi_arabia | ENT and | CCONJ him | ADJ motif | NOUN that | ADJ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ADP money | NOUN and | CCONJ arithmetic | NOUN faces | VERB with | ADP painful_choice | NOUN of | ADP losses | VERB money | NOUN supports | VERB current_production | NOUN at | ADP us $ | SYM 60 | MONEY FOR | ADP barrel | NOUN or | CCONJ accepts | VERB two_million | CARDINAL barrel | NOUN per | ADP day | NOUN off | ADP Market | NOUN and | CCONJ losses | VERB much_more_money | NOUN it | PRON | VERB easy_choice | NOUN take | VERB path | NOUN that | ADJ is | VERB less | ADV painful | ADJ if | ADP there | ADV are | VERB secondary_reason | NOUN like | ADP hurting | VERB us | ENT tight_oil_producer | NOUN or | CCONJ hurting | VERB iran | ENT and | CCONJ Russia | ENT 's | VERB great | Adj but | CCONJ it | PRON | Verb really | ADV just | ADV about | ADP money | NOUN
- Double line breaks are interpreted as separate documents.
- URLs are recognized as such, shared on domain.tld and marked as URL
- Nouns (also a noun that is part of nouns) are lemmatized (as motives become motives)
- Words with POS tags such as DET (specific article) and PUNCT (for punctuation) are discarded
Here is the code. Let me know if you have any questions.
I will publish it soon at github.com/woltob.
import spacy import re nlp = spacy.load('en') nlp.matcher = None LABELS = { 'ENT': 'ENT', 'PERSON': 'PERSON', 'NORP': 'ENT', 'FAC': 'ENT', 'ORG': 'ENT', 'GPE': 'ENT', 'LOC': 'ENT', 'LAW': 'ENT', 'PRODUCT': 'ENT', 'EVENT': 'ENT', 'WORK_OF_ART': 'ENT', 'LANGUAGE': 'ENT', 'DATE': 'DATE', 'TIME': 'TIME', 'PERCENT': 'PERCENT', 'MONEY': 'MONEY', 'QUANTITY': 'QUANTITY', 'ORDINAL': 'ORDINAL', 'CARDINAL': 'CARDINAL' } pre_format_re = re.compile(r'^[\`\*\~]') post_format_re = re.compile(r'[\`\*\~]$') url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})') single_linebreak_re = re.compile('\n') double_linebreak_re = re.compile('\n{2,}') whitespace_re = re.compile(r'[ \t]+') quote_re = re.compile(r'"|`|´') def strip_meta(text): text = text.replace('per cent', 'percent') text = text.replace('>', '>').replace('<', '<') text = pre_format_re.sub('', text) text = post_format_re.sub('', text) text = double_linebreak_re.sub('{2break}', text) text = single_linebreak_re.sub(' ', text) text = text.replace('{2break}', '\n') text = whitespace_re.sub(' ', text) text = quote_re.sub('', text) return text def transform_doc(doc): for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_]) for np in doc.noun_chunks: while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'): np = np[1:] np.merge(np.root.tag_, np.text, np.root.ent_type_) strings = [] for sent in doc.sents: sentence = [] if sent.text.strip(): for w in sent: if w.is_space: continue w_ = represent_word(w) if w_: sentence.append(w_) strings.append(' '.join(sentence)) if strings: return '\n'.join(strings) + '\n' else: return '' def represent_word(word): if word.like_url: x = url_re.search(word.text.strip().lower()) if x: return x.group(3)+'|URL' else: return word.text.lower().strip()+'|URL?' text = re.sub(r'\s', '_', word.text.strip().lower()) tag = LABELS.get(word.ent_type_)
You can visualize your model with Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard
I will also configure this code to work with the sense2vec approach (for example, words at the preprocessing stage become lowercase in lower case, just comment it in the code).
Happy coding, woltob