How to train sense2vec model

The sense2vec documentation mentions 3 main files - the first of them is merge_text.py. I tried several types of input-txt, csv, bzipped files, since merge_text.py is trying to open bzip2-compressed files.

The file can be found at: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

What type of input format is required for this script? Also, if anyone can suggest how to train the model.

+6
source share
2 answers

I have expanded and adjusted code samples from sense2vec.

You go from this input text:

“As for Saudi Arabia and its motives, it’s also very simple. Good money and arithmetic. Faced with the painful choice of losing money, maintaining current production at $ 60 a barrel or taking two million barrels a day from the market and losing a lot more money "It's a simple choice: to take a path that is less painful. If there are secondary reasons, such as pain in the US, poor oil producers or damage Iran and Russia, it's great, but it's really just about the money."

For this:

as | ADV far | ADV as | ADP saudi_arabia | ENT and | CCONJ him | ADJ motif | NOUN that | ADJ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ADP money | NOUN and | CCONJ arithmetic | NOUN faces | VERB with | ADP painful_choice | NOUN of | ADP losses | VERB money | NOUN supports | VERB current_production | NOUN at | ADP us $ | SYM 60 | MONEY FOR | ADP barrel | NOUN or | CCONJ accepts | VERB two_million | CARDINAL barrel | NOUN per | ADP day | NOUN off | ADP Market | NOUN and | CCONJ losses | VERB much_more_money | NOUN it | PRON | VERB easy_choice | NOUN take | VERB path | NOUN that | ADJ is | VERB less | ADV painful | ADJ if | ADP there | ADV are | VERB secondary_reason | NOUN like | ADP hurting | VERB us | ENT tight_oil_producer | NOUN or | CCONJ hurting | VERB iran | ENT and | CCONJ Russia | ENT 's | VERB great | Adj but | CCONJ it | PRON | Verb really | ADV just | ADV about | ADP money | NOUN

  • Double line breaks are interpreted as separate documents.
  • URLs are recognized as such, shared on domain.tld and marked as URL
  • Nouns (also a noun that is part of nouns) are lemmatized (as motives become motives)
  • Words with POS tags such as DET (specific article) and PUNCT (for punctuation) are discarded

Here is the code. Let me know if you have any questions.

I will publish it soon at github.com/woltob.

import spacy import re nlp = spacy.load('en') nlp.matcher = None LABELS = { 'ENT': 'ENT', 'PERSON': 'PERSON', 'NORP': 'ENT', 'FAC': 'ENT', 'ORG': 'ENT', 'GPE': 'ENT', 'LOC': 'ENT', 'LAW': 'ENT', 'PRODUCT': 'ENT', 'EVENT': 'ENT', 'WORK_OF_ART': 'ENT', 'LANGUAGE': 'ENT', 'DATE': 'DATE', 'TIME': 'TIME', 'PERCENT': 'PERCENT', 'MONEY': 'MONEY', 'QUANTITY': 'QUANTITY', 'ORDINAL': 'ORDINAL', 'CARDINAL': 'CARDINAL' } pre_format_re = re.compile(r'^[\`\*\~]') post_format_re = re.compile(r'[\`\*\~]$') url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})') single_linebreak_re = re.compile('\n') double_linebreak_re = re.compile('\n{2,}') whitespace_re = re.compile(r'[ \t]+') quote_re = re.compile(r'"|`|´') def strip_meta(text): text = text.replace('per cent', 'percent') text = text.replace('&gt;', '>').replace('&lt;', '<') text = pre_format_re.sub('', text) text = post_format_re.sub('', text) text = double_linebreak_re.sub('{2break}', text) text = single_linebreak_re.sub(' ', text) text = text.replace('{2break}', '\n') text = whitespace_re.sub(' ', text) text = quote_re.sub('', text) return text def transform_doc(doc): for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_]) for np in doc.noun_chunks: while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'): np = np[1:] np.merge(np.root.tag_, np.text, np.root.ent_type_) strings = [] for sent in doc.sents: sentence = [] if sent.text.strip(): for w in sent: if w.is_space: continue w_ = represent_word(w) if w_: sentence.append(w_) strings.append(' '.join(sentence)) if strings: return '\n'.join(strings) + '\n' else: return '' def represent_word(word): if word.like_url: x = url_re.search(word.text.strip().lower()) if x: return x.group(3)+'|URL' else: return word.text.lower().strip()+'|URL?' text = re.sub(r'\s', '_', word.text.strip().lower()) tag = LABELS.get(word.ent_type_) # Dropping PUNCTUATION such as commas and DET like the if tag is None and word.pos_ not in ['PUNCT', 'DET']: tag = word.pos_ elif tag is None: return None # if not word.pos_: # tag = '?' return text + '|' + tag corpus = ''' As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that great, but it really just about the money. ''' corpus_stripped = strip_meta(corpus) doc = nlp(corpus_stripped) corpus_ = [] for word in doc: # only lemmatize NOUN and PROPN if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_): # Keep the original word with the length of the lemma, then add the white space, if it was there.: lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):]) # print(word.text, lemma_) corpus_.append(lemma_) # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):]) # All other words are added normally. else: corpus_.append(word.text_with_ws) result = transform_doc(nlp(''.join(corpus_))) sense2vec_filename = 'text.txt' file = open(sense2vec_filename,'w') file.write(result) file.close() print(result) 

You can visualize your model with Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard

I will also configure this code to work with the sense2vec approach (for example, words at the preprocessing stage become lowercase in lower case, just comment it in the code).

Happy coding, woltob

+3
source

The input file must be jz-jz. To use a simple text file, simply edit merge_text.py as follows:

 def iter_comments(loc): with bz2.BZ2File(loc) as file_: for i, line in enumerate(file_): yield line.decode('utf-8', errors='ignore') # yield ujson.loads(line)['body'] 
0
source

All Articles