So, as a little thought experiment, I encoded a function in python that uses spaCy to find the topic of a news article, and then replace it with a noun of choice. The problem is that this does not work well, and I was hoping that this could be improved. I definitely don't understand spaCy, and the documentation is a little hard to understand.
First code:
doc=nlp(thetitle) for text in doc: #subject would be if text.dep_ == "nsubj": subject = text.orth_ #iobj for indirect object if text.dep_ == "iobj": indirect_object = text.orth_ #dobj for direct object if text.dep_ == "dobj": direct_object = text.orth_ try: subject except NameError: if not thetitle: #if empty title thetitle = "cat" subject = "cat" else: #if unknown subject try: #do we have a direct object? direct_object except NameError: try: #do we have an indirect object? indirect_object except NameError: #still no?? subject = random.choice(thetitle.split()) else: subject = indirect_object else: subject = direct_object else: thecat = "cat" #do nothing here, everything went okay newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle) if (newtitle == thetitle) : #if no replacement happened due to regex newtitle = thetitle.replace(subject, toreplace) return newtitle
cat lines are filler lines that do nothing. "thetitle" is a variable for the random title of a news article that I pull from RSS feeds. "toreplace" is a variable that contains a string to replace any found object.
Use an example:
"Video Games That Must Be Animated Television Shows - The Rant Screen" And here is the crowding out of this: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows% 20-% 20Screen% 20Rant & model = en & cpu = 1 & cph = 1
The word that the code decided to replace turned out to be βoneβ that is not even a noun in this sentence, but seems to have led to the refusal of a random choice of words, because it could not find an object, indirect object or direct object. I hope that in this example he finds something like "Video Games."
I should note if I select the last bit (which seems to be the source for the news article) in the navigation file: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be% 20Animated% 20TV% 20Shows & model = en & cpu = 1 & cph = 1 it seems that βthisβ is the subject, which is wrong.
What is the best way to parse this? Should I search for nouns first?
python spacy
Spacemouse
source share