Best way to programmatically save a webpage in a static HTML file

The more research I do, the more gloomy the worldview becomes.

I am trying to execute Flat Save or Static Save a web page with Python. This means merging all styles with built-in properties and changing all links to absolute URLs.

I have tried almost every free conversion website, api and even github libraries. No one is impressive. The best python implementation I could find for style alignment is https://github.com/davecranwell/inline-styler . I adapted it a bit for Flask, but the generated file is not so good. Here's what it looks like:

enter image description here

Obviously, he should look better. Here's what it should look like: http://cl.ly/image/1H3J1O1u3v3d

This seems like an endless struggle with malformed html, unrecognized CSS properties, Unicode errors, etc. And does anyone have a suggestion on a better way to do this? I understand that I can go to file -> save in my local browser, but when I try to do this en mass and extract a specific xpath that is not really viable.

It seems like the Evernote web clipper uses iFrames, but it seems more complicated than I think. But at least the clippings look decent on Evernote.

I am interested to know if anyone has any suggestions.

+8
python html css html-parsing
source share
2 answers

It seems that inline styles may be intruders for you, but if not, I suggest taking a look at Evernote Web Clipper. The desktop application has an HTML export feature for web clips. The output is a bit dirty, as you would expect with inline styles, but I found that markup is a reliable representation of the saved page.

As for the built-in and external styles, for something like this, I don’t see any relation to inline if you make a lot of pages from different sites where class names will have conflicting style rules.

You mentioned that Web Clipper uses iFrames, but I have not found that this is the case for HTML output. You will probably have to embed a static page as an iFrame if you are republishing to another site (legally I assume), but otherwise it should not be a problem.

Some automation will certainly help, so you can go directly from the browser to the HTML output and possibly move the saved images into one repo with updated src links in HTML. If you finish working on something like this, I would appreciate a try myself.

+2
source share

After being fired, I managed to install a ruby ​​library for some time, which smooths CSS much better than everything I used. This is the library behind the very slow web interface here http://premailer.dialect.ca/

Thank God they released a source on Github, these are the best hands. https://github.com/alexdunae/premailer

It aligns styles, creates absolute URLs, works with a URL or a string, and can even create simple text email templates. Very impressed with this library.

Nov 2013 update

As a result, I wrote my own bookmarklet, which works exclusively on the client side. It is compatible only with Webkit and FireFox. It repeats through each node and adds inline styles, and then sends the flattened HTML to the clippy.in API for saving in the user dashboard.

Client bookmark

+2
source share

All Articles