Goose really uses a few predefined elements, which are probably a good starting point for finding the top node. If no "known" elements are found, he starts looking for top_node , which in the general case is an element containing many p tags inside it. You can read extractors/content.py for more details.
This article does not have many features of a general article, which is usually wrapped inside an article tag or div tag with a class and identifier, for example, "post-content", "story-body", "article", etc. This is a div tag with id = 'docText' and has no paragraphs, so Goose cannot predict what will be good with him.
I suggest you add this line at the beginning of the constant KNOWN_ARTICLE_CONTENT_TAGS to extractors/content.py :
KNOWN_ARTICLE_CONTENT_TAGS = [ {'attr': 'id', 'value': 'docText'}, ... other paths go here ]
and here is the extracted body:
Chennai, December 19 - The Tamil Nadu government on Monday appointed a unilateral judicial commission of inquiry to examine the causes of Sunday expiring in Chennaiโs state capital, which killed 42 people and left another 37 wounded. \ N \ nThe announcement of the formation of the commission came even as family members those who died as a result of the requiem are tormented and excited about the unexpected tragedy. \ n \ nThe 42 homeless people were trampled to death during the distribution of flood relief supplies to a shelter in the capital Tamil Nadu. \ n \ nOfficial officials said more than 5,000 people burst when the shelter gates opened, causing a stampede. \ n \ nChitra, member the victimโs family, said there was mismanagement, which led to the tragedy. \ U2026
Thiem nguyen
source share