Check image urls using python recycling

On the website I am creating, I use Python-Markdown to format news posts. To avoid issues with dead links and HTTP content issues on an HTTPS page, I require editors to upload all the images to the site and then paste them (I use the tag editor, which I fixed to make it easier to embed these images using standard syntax markdowns).

However, I would like to apply the no-external-images policy in my code.

One way is to write a regular expression to extract the image URLs from the markup source code, or even run it through the markdown renderer and use the DOM parser to extract all src attributes from img tags.

However, I am curious if there is a way to connect to Python-Markdown to extract all links to images or execute native code (e.g. throwing an exception if the link is external) during parsing.

+4
source share
1 answer

One approach would be to grab the <img> node at a lower level immediately after Markdown analyzes and builds it:

 import re from markdown import Markdown from markdown.inlinepatterns import ImagePattern, IMAGE_LINK_RE RE_REMOTEIMG = re.compile('^(http|https):.+') class CheckImagePattern(ImagePattern): def handleMatch(self, m): node = ImagePattern.handleMatch(self, m) # check 'src' to ensure it is local src = node.attrib.get('src') if src and RE_REMOTEIMG.match(src): print 'ILLEGAL:', m.group(9) # or alternately you could raise an error immediately # raise ValueError("illegal remote url: %s" % m.group(9)) return node DATA = ''' ![Alt text](/path/to/img.jpg) ![Alt text](http://remote.com/path/to/img.jpg) ''' mk = Markdown() # patch in the customized image pattern matcher with url checking mk.inlinePatterns['image_link'] = CheckImagePattern(IMAGE_LINK_RE, mk) result = mk.convert(DATA) print result 

Output:

 ILLEGAL: http://remote.com/path/to/img.jpg <p><img alt="Alt text" src="/path/to/img.jpg" /> <img alt="Alt text" src="http://remote.com/path/to/img.jpg" /></p> 
+6
source

All Articles