Read the content of an article without getting anything

Question

Read the content of an article without getting anything

I am trying to goose read from .html files (the URL provided here is for ease of use in the examples) [1] . But sometimes it does not show any text. Please help me with this problem.

Used version of Goose: https://github.com/agolo/python-goose/ The current version gives some errors.

from goose import Goose from requests import get response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text print text

-one

python web-crawler goose

Abhishek bhatia May 21 '15 at 18:45

source share

1 answer

Thiem nguyen · Accepted Answer · 2015-05-23T04:00:32+0000

Goose really uses a few predefined elements, which are probably a good starting point for finding the top node. If no "known" elements are found, he starts looking for top_node , which in the general case is an element containing many p tags inside it. You can read extractors/content.py for more details.

This article does not have many features of a general article, which is usually wrapped inside an article tag or div tag with a class and identifier, for example, "post-content", "story-body", "article", etc. This is a div tag with id = 'docText' and has no paragraphs, so Goose cannot predict what will be good with him.

I suggest you add this line at the beginning of the constant KNOWN_ARTICLE_CONTENT_TAGS to extractors/content.py :

 KNOWN_ARTICLE_CONTENT_TAGS = [ {'attr': 'id', 'value': 'docText'}, ... other paths go here ]

and here is the extracted body:

Chennai, December 19 - The Tamil Nadu government on Monday appointed a unilateral judicial commission of inquiry to examine the causes of Sunday expiring in Chennai’s state capital, which killed 42 people and left another 37 wounded. \ N \ nThe announcement of the formation of the commission came even as family members those who died as a result of the requiem are tormented and excited about the unexpected tragedy. \ n \ nThe 42 homeless people were trampled to death during the distribution of flood relief supplies to a shelter in the capital Tamil Nadu. \ n \ nOfficial officials said more than 5,000 people burst when the shelter gates opened, causing a stampede. \ n \ nChitra, member the victim’s family, said there was mismanagement, which led to the tragedy. \ U2026

Read the content of an article without getting anything

More articles: