Using spaCy to replace "topic" suggestions

So, as a little thought experiment, I encoded a function in python that uses spaCy to find the topic of a news article, and then replace it with a noun of choice. The problem is that this does not work well, and I was hoping that this could be improved. I definitely don't understand spaCy, and the documentation is a little hard to understand.

First code:

doc=nlp(thetitle) for text in doc: #subject would be if text.dep_ == "nsubj": subject = text.orth_ #iobj for indirect object if text.dep_ == "iobj": indirect_object = text.orth_ #dobj for direct object if text.dep_ == "dobj": direct_object = text.orth_ try: subject except NameError: if not thetitle: #if empty title thetitle = "cat" subject = "cat" else: #if unknown subject try: #do we have a direct object? direct_object except NameError: try: #do we have an indirect object? indirect_object except NameError: #still no?? subject = random.choice(thetitle.split()) else: subject = indirect_object else: subject = direct_object else: thecat = "cat" #do nothing here, everything went okay newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle) if (newtitle == thetitle) : #if no replacement happened due to regex newtitle = thetitle.replace(subject, toreplace) return newtitle 

cat lines are filler lines that do nothing. "thetitle" is a variable for the random title of a news article that I pull from RSS feeds. "toreplace" is a variable that contains a string to replace any found object.

Use an example:

"Video Games That Must Be Animated Television Shows - The Rant Screen" And here is the crowding out of this: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows% 20-% 20Screen% 20Rant & model = en & cpu = 1 & cph = 1

The word that the code decided to replace turned out to be β€œone” that is not even a noun in this sentence, but seems to have led to the refusal of a random choice of words, because it could not find an object, indirect object or direct object. I hope that in this example he finds something like "Video Games."

I should note if I select the last bit (which seems to be the source for the news article) in the navigation file: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be% 20Animated% 20TV% 20Shows & model = en & cpu = 1 & cph = 1 it seems that β€œthis” is the subject, which is wrong.

What is the best way to parse this? Should I search for nouns first?

+7
python spacy
source share
1 answer

Directly answering your question, I think that the code below is much more readable, because the conditions are obvious, and what happens when the condition is valid is not hidden far in the else clause. This code also takes care of cases with multiple objects.

To your problem: any natural language processing tool will have a hard time finding the topic (or maybe quite the topic) of the sentence fragment, they are trained with full sentences. I'm not even sure if such fragments have technically objects (I'm not an expert, though). You could try to train your own model, but then you have to provide tagged sentences, I don’t know if such a thing exists for fragments of sentences.

I'm not quite sure what you want to achieve by looking at common nouns and pronouns, it may probably contain the word you want to replace, and the first thing that appears is probably the most important.

 import spacy import random import re from collections import defaultdict def replace_subj(sentence, nlp): doc = nlp(sentence) tokens = defaultdict(list) for text in doc: tokens[text.dep_].append(text.orth_) if not sentence: return "cat" if "nsubj" in tokens: subject = tokens["nsubj"][0] elif "dobj" in tokens: subject = tokens["dobj"][0] elif "iobj" in tokens: subject = tokens["iobj"][0] else: subject = random.choice(sentence.split()) return re.sub(r"\b{}\b".format(subject), "cat", sentence) if __name__ == "__main__": sentence = """Video Games that Should Be Animated TV Shows - Screen Rant""" nlp = spacy.load("en") print(replace_subj(sentence, nlp)) 
+1
source share

All Articles