How to train sense2vec model

Question

How to train sense2vec model

The sense2vec documentation mentions 3 main files - the first of them is merge_text.py. I tried several types of input-txt, csv, bzipped files, since merge_text.py is trying to open bzip2-compressed files.

The file can be found at: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

What type of input format is required for this script? Also, if anyone can suggest how to train the model.

+6

python spacy word2vec

Sushant Jun 21 '16 at 13:36

source share

2 answers

woltob · Answer 1 · 2017-03-29T15:20:56+0000

I have expanded and adjusted code samples from sense2vec.

You go from this input text:

“As for Saudi Arabia and its motives, it’s also very simple. Good money and arithmetic. Faced with the painful choice of losing money, maintaining current production at $ 60 a barrel or taking two million barrels a day from the market and losing a lot more money "It's a simple choice: to take a path that is less painful. If there are secondary reasons, such as pain in the US, poor oil producers or damage Iran and Russia, it's great, but it's really just about the money."

For this:

Double line breaks are interpreted as separate documents.
URLs are recognized as such, shared on domain.tld and marked as URL
Nouns (also a noun that is part of nouns) are lemmatized (as motives become motives)
Words with POS tags such as DET (specific article) and PUNCT (for punctuation) are discarded

Here is the code. Let me know if you have any questions.

I will publish it soon at github.com/woltob.

import spacy import re nlp = spacy.load('en') nlp.matcher = None LABELS = { 'ENT': 'ENT', 'PERSON': 'PERSON', 'NORP': 'ENT', 'FAC': 'ENT', 'ORG': 'ENT', 'GPE': 'ENT', 'LOC': 'ENT', 'LAW': 'ENT', 'PRODUCT': 'ENT', 'EVENT': 'ENT', 'WORK_OF_ART': 'ENT', 'LANGUAGE': 'ENT', 'DATE': 'DATE', 'TIME': 'TIME', 'PERCENT': 'PERCENT', 'MONEY': 'MONEY', 'QUANTITY': 'QUANTITY', 'ORDINAL': 'ORDINAL', 'CARDINAL': 'CARDINAL' } pre_format_re = re.compile(r'^[\`\*\~]') post_format_re = re.compile(r'[\`\*\~]$') url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})') single_linebreak_re = re.compile('\n') double_linebreak_re = re.compile('\n{2,}') whitespace_re = re.compile(r'[ \t]+') quote_re = re.compile(r'"|`|´') def strip_meta(text): text = text.replace('per cent', 'percent') text = text.replace('&gt;', '>').replace('&lt;', '<') text = pre_format_re.sub('', text) text = post_format_re.sub('', text) text = double_linebreak_re.sub('{2break}', text) text = single_linebreak_re.sub(' ', text) text = text.replace('{2break}', '\n') text = whitespace_re.sub(' ', text) text = quote_re.sub('', text) return text def transform_doc(doc): for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_]) for np in doc.noun_chunks: while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'): np = np[1:] np.merge(np.root.tag_, np.text, np.root.ent_type_) strings = [] for sent in doc.sents: sentence = [] if sent.text.strip(): for w in sent: if w.is_space: continue w_ = represent_word(w) if w_: sentence.append(w_) strings.append(' '.join(sentence)) if strings: return '\n'.join(strings) + '\n' else: return '' def represent_word(word): if word.like_url: x = url_re.search(word.text.strip().lower()) if x: return x.group(3)+'|URL' else: return word.text.lower().strip()+'|URL?' text = re.sub(r'\s', '_', word.text.strip().lower()) tag = LABELS.get(word.ent_type_) # Dropping PUNCTUATION such as commas and DET like the if tag is None and word.pos_ not in ['PUNCT', 'DET']: tag = word.pos_ elif tag is None: return None # if not word.pos_: # tag = '?' return text + '|' + tag corpus = ''' As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that great, but it really just about the money. ''' corpus_stripped = strip_meta(corpus) doc = nlp(corpus_stripped) corpus_ = [] for word in doc: # only lemmatize NOUN and PROPN if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_): # Keep the original word with the length of the lemma, then add the white space, if it was there.: lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):]) # print(word.text, lemma_) corpus_.append(lemma_) # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):]) # All other words are added normally. else: corpus_.append(word.text_with_ws) result = transform_doc(nlp(''.join(corpus_))) sense2vec_filename = 'text.txt' file = open(sense2vec_filename,'w') file.write(result) file.close() print(result)

You can visualize your model with Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard

I will also configure this code to work with the sense2vec approach (for example, words at the preprocessing stage become lowercase in lower case, just comment it in the code).

Happy coding, woltob

oskar alfons · Answer 2 · 2016-08-09T08:10:34+0000

The input file must be jz-jz. To use a simple text file, simply edit merge_text.py as follows:

 def iter_comments(loc): with bz2.BZ2File(loc) as file_: for i, line in enumerate(file_): yield line.decode('utf-8', errors='ignore') # yield ujson.loads(line)['body']

How to train sense2vec model

I will publish it soon at github.com/woltob.

More articles: