Using PIL to Detect Blank Page Scan

Therefore, I often run huge tasks of two-sided verification on the non-intelligent multi-functional Canon, which leaves me a huge JPEG folder. Am I crazy to consider using PIL to analyze an image folder to detect crawling blank pages and mark them for deletion?

Leaving the drop-down fragments and skipping parts of them, I think it will look something like this:

  • Make sure the image is grayed out as this is considered undefined.
  • If so, determine the dominant hue range (background color).
  • If not, identify the dominant hue range by limiting the light gray.
  • Determine what percentage of the entire image consists of the specified shades.
  • Try to find a threshold that adequately detects pages with a type or record or images.
  • You may need to check the image fragments at a time to increase the accuracy of the threshold.

I know this is a kind of edge case, but can anyone with PIL experience give any pointers?

+5
source share
3 answers

Here is an alternative solution using mahotas and milk .

  • Start by creating two directories: positives/ and negatives/ , where you manually select a few examples.
  • I assume the rest of the data is in the unlabeled/ directory
  • Calculate functions for all images in positives and negatives
  • recognize the classifier
  • use this classifier on unlabeled images

In the code below, I used jug to give you the ability to run it on multiple processors, but the code also works if you delete every line that mentions TaskGenerator

 from glob import glob import mahotas import mahotas.features import milk from jug import TaskGenerator @TaskGenerator def features_for(imname): img = mahotas.imread(imname) return mahotas.features.haralick(img).mean(0) @TaskGenerator def learn_model(features, labels): learner = milk.defaultclassifier() return learner.train(features, labels) @TaskGenerator def classify(model, features): return model.apply(features) positives = glob('positives/*.jpg') negatives = glob('negatives/*.jpg') unlabeled = glob('unlabeled/*.jpg') features = map(features_for, negatives + positives) labels = [0] * len(negatives) + [1] * len(positives) model = learn_model(features, labels) labeled = [classify(model, features_for(u)) for u in unlabeled] 

This uses texture functions that are probably good enough, but you can play with other functions in mahotas.features if you want (or try mahotas.surf , but that gets more complicated). In general, it was difficult for me to perform the classification with the hard thresholds you are looking for if the scan is very controlled.

+11
source

As a first attempt, sort the image folder by file size. If all checks from the same document have the same resolution, blank pages will certainly lead to smaller files than to non-empty ones.

I donโ€™t know how many pages you are viewing, but if the number is low enough, this can be an easy quick fix.

+4
source

A few non-PIL suggestions to consider:

Scanning printed or written material will have many high-contrast sharp edges; something like a median filter (to reduce noise), followed by some simple edge detection, can do a good job of recognizing real content from blank pages.

Testing snippets at a time is useful, not only because it can increase your accuracy, but also because it can help you give up early use on many pages. Most of your scans are supposedly not empty, so you should start with a simple check that usually indicates non-empty pages as non-empty; only if he says that the page may be blank, you need to look more closely.

If either the lighting or the page itself is uneven, you can start with something like image = image-filter(image) , where filter has very wide anti-aliasing. This will reduce the need to identify dominant hues, and also cope when the dominant hue is not uniform enough throughout the page.

+2
source

All Articles