Getting a resume in the form of Facebook (title, resume, relevant images) using Python

I would like to reproduce the functionality that Facebook uses to parse links. When you submit a link to your Facebook status, their system shuts down and extracts the proposed title , summary and often one or more relevant image from this page, from which you can choose a thumbnail.

My application should accomplish this using Python, but I am open to any guidance, blog post, or experiences of other developers that relate to this and can help me figure out how to do this.

I would love to learn from other people before just jumping.

To be clear, when you provide the URL of a webpage, I want to be able to get:

  • Title: Probably only the <title> , but possibly <h1> , not sure.
  • An overview of the page with one paragraph.
  • A bunch of related images that can be used as thumbnails. (The hard part is filtering out irrelevant images such as banners or rounded corners).

I may have to implement it myself, but at least I would like to learn about how other people perform such tasks.

+7
python semantics facebook screen-scraping summary
source share
2 answers

BeautifulSoup is well suited to achieve most of this.

Basically, you just initialize the soup object and then do something like the following to extract what interests you:

 title = soup.findAll('title') images = soup.findAll('img') 

You can then download each of the images based on their url using urllib2 .

The name is pretty simple, but the images can be a little more complicated, since you need to download each to get the relevant statistics. Perhaps you can filter most of the images based on the size and number of colors? Rounded corners, as an example, will be small and have only 1-2 colors, as a rule.

As for the pagination, this might be a little trickier, but I was doing something like this:

  • I use BeautifulSoup to remove all styles, script, form and head blocks from html using: .findAll , then .extract .
  • I grab the remaining text using: .join(soup.findAll(text = True))

In your application, perhaps you could use this "text" content as a page summary?

Hope this helps.

+2
source share

Here's the full solution: https://github.com/svven/summary

 >>> import summary >>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum') >>> s.extract() >>> s.title u'User Ram Rachum - Stack Overflow' >>> s.description u'Israeli Python hacker.' >>> s.image https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic on&r=PG >>> 
+1
source share

All Articles