How do I know if PDF pages are color or black and white?

Given the set of PDF files, among which some pages are color and the rest are black and white, is there any program to search among given pages that are color and are black and white? This would be useful, for example, when printing abstracts, and only the cost of printing color pages. Bonus points for those who consider duplex printing and send the corresponding black and white page to a color printer if it is followed by a color page from the opposite side.

+55
colors parsing pdf automation printers
Mar 13 '09 at 4:07
source share
6 answers

This is one of the most interesting questions I've seen! I agree with some other posts that rendering a bitmap and then analyzing the bitmap would be the most reliable solution. For simple PDFs, a faster but less complete approach is here.

  • Parsing each PDF page
  • Look for color directives (g, rg, k, sc, scn, etc.)
  • Look at the embedded images, analyze the color

My solution is below # 1 and half # 2. The other half of No. 2 will follow the user-defined color, which includes searching for entries / ColorSpace on the page and decoding them - contact me offline if you are interested, as it is very convenient but not in 5 minutes.

First, the main program:

use CAM::PDF; my $infile = shift; my $pdf = CAM::PDF->new($infile); PAGE: for my $p (1 .. $pdf->numPages) { my $tree = $pdf->getPageContentTree($p); if (!$tree) { print "Failed to parse page $p\n"; next PAGE; } my $colors = $tree->traverse('My::Renderer::FindColors')->{colors}; my $uncertain = 0; for my $color (@{$colors}) { my ($name, @rest) = @{$color}; if ($name eq 'g') { } elsif ($name eq 'rgb') { my ($r, $g, $b) = @rest; if ($r != $g || $r != $b) { print "Page $p is color\n"; next PAGE; } } elsif ($name eq 'cmyk') { my ($c, $m, $y, $k) = @rest; if ($c != 0 || $m != 0 || $y != 0) { print "Page $p is color\n"; next PAGE; } } else { $uncertain = $name; } } if ($uncertain) { print "Page $p has user-defined color ($uncertain), needs more investigation\n"; } else { print "Page $p is grayscale\n"; } } 

And then here is a helper renderer that handles the color directives on each page:

 package My::Renderer::FindColors; sub new { my $pkg = shift; return bless { colors => [] }, $pkg; } sub clone { my $self = shift; my $pkg = ref $self; return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg; } sub rg { my ($self, $r, $g, $b) = @_; push @{$self->{colors}}, ['rgb', $r, $g, $b]; } sub g { my ($self, $gray) = @_; push @{$self->{colors}}, ['rgb', $gray, $gray, $gray]; } sub k { my ($self, $c, $m, $y, $k) = @_; push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k]; } sub cs { my ($self, $name) = @_; $self->{cs} = $name; } sub cs { my ($self, $name) = @_; $self->{CS} = $name; } sub _sc { my ($self, $cs, @rest) = @_; return if !$cs; # syntax error if ($cs eq 'DeviceRGB') { $self->rg(@rest); } elsif ($cs eq 'DeviceGray') { $self->g(@rest); } elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); } else { push @{$self->{colors}}, [$cs, @rest]; } } sub sc { my ($self, @rest) = @_; $self->_sc($self->{cs}, @rest); } sub SC { my ($self, @rest) = @_; $self->_sc($self->{CS}, @rest); } sub scn { sc(@_); } sub SCN { SC(@_); } sub RG { rg(@_); } sub G { g(@_); } sub K { k(@_); } 
+27
Mar 15 '09 at 12:28
source share

You can use the Image Magick identify tool. If used on PDF pages, it first converts the page to a bitmap. If the page containing the color can be tested with the -format "%[colorspace]" parameter -format "%[colorspace]" , which for my PDF file was printed as Gray or RGB . IMHO identify (or what tool does it use in the background of Ghostscript?), Selects a color space based on color.

Example:

 identify -format "%[colorspace]" $FILE.pdf[$PAGE] 

where PAGE is the page starting with 0, not 1. If the page selection is not used, all pages will be collapsed to one that you do not need.

I wrote the following BASH script that uses pdfinfo to get the number of pages and then loops over them. Output color pages. I also added a feature for a two-sided document, where you might need an unpainted back page.

Using the highlighted list of separated spaces, colored PDF pages can be extracted using pdftk :

 pdftk $FILE cat $PAGELIST output color_${FILE}.pdf 



 #!/bin/bash FILE=$1 PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//') GRAYPAGES="" COLORPAGES="" DOUBLECOLORPAGES="" echo "Pages: $PAGES" N=1 while (test "$N" -le "$PAGES") do COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" ) echo "$N: $COLORSPACE" if [[ $COLORSPACE == "Gray" ]] then GRAYPAGES="$GRAYPAGES $N" else COLORPAGES="$COLORPAGES $N" # For double sided documents also list the page on the other side of the sheet: if [[ $((N%2)) -eq 1 ]] then DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))" #N=$((N+1)) else DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N" fi fi N=$((N+1)) done echo $DOUBLECOLORPAGES echo $COLORPAGES echo $GRAYPAGES #pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf 
+15
Apr 26 '12 at 22:11
source share

Newer versions of Ghostscript (version 9.05 and later) include a โ€œdeviceโ€ called ink. It calculates the ink coverage of each page (not for each image) in the values โ€‹โ€‹Cyan (C), Magenta (M), Yellow (Y) and Black (K), where 0.00000 means 0% and 1.00000 means 100% (see Detecting all pages containing color ).

For example:

 $ gs -q -o - -sDEVICE=inkcov file.pdf 0.11264 0.11605 0.11605 0.09364 CMYK OK 0.11260 0.11601 0.11601 0.09360 CMYK OK 

If the CMY values โ€‹โ€‹are not 0, the page is color.

To simply display pages containing colors, use this handy oneliner:

 $ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000 0.00000 0.00000 / d' 
+11
Feb 06 '15 at 15:52
source share

The script from Martin Sharrera is wonderful. It contains a minor error: it counts two pages that contain color and are directly consecutive twice. I fixed it. In addition, the script now counts pages and displays pages in grayscale for double page printing. It also prints pages separated by commas, so the output can be directly used to print from a PDF viewer. I added the code, but you can download it here .

Cheers, TimeShift

 #!/bin/bash if [ $# -ne 1 ] then echo "USAGE: This script needs exactly one paramter: the path to the PDF" kill -SIGINT $$ fi FILE=$1 PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//') GRAYPAGES="" COLORPAGES="" DOUBLECOLORPAGES="" DOUBLEGRAYPAGES="" OLDGP="" DOUBLEPAGE=0 DPGC=0 DPCC=0 SPGC=0 SPCC=0 echo "Pages: $PAGES" N=1 while (test "$N" -le "$PAGES") do COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" ) echo "$N: $COLORSPACE" if [[ $DOUBLEPAGE -eq -1 ]] then DOUBLEGRAYPAGES="$OLDGP" DPGC=$((DPGC-1)) DOUBLEPAGE=0 fi if [[ $COLORSPACE == "Gray" ]] then GRAYPAGES="$GRAYPAGES,$N" SPGC=$((SPGC+1)) if [[ $DOUBLEPAGE -eq 0 ]] then OLDGP="$DOUBLEGRAYPAGES" DOUBLEGRAYPAGES="$DOUBLEGRAYPAGES,$N" DPGC=$((DPGC+1)) else DOUBLEPAGE=0 fi else COLORPAGES="$COLORPAGES,$N" SPCC=$((SPCC+1)) # For double sided documents also list the page on the other side of the sheet: if [[ $((N%2)) -eq 1 ]] then DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$N,$((N+1))" DOUBLEPAGE=$((N+1)) DPCC=$((DPCC+2)) #N=$((N+1)) else if [[ $DOUBLEPAGE -eq 0 ]] then DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$((N-1)),$N" DPCC=$((DPCC+2)) DOUBLEPAGE=-1 elif [[ $DOUBLEPAGE -gt 0 ]] then DOUBLEPAGE=0 fi fi fi N=$((N+1)) done echo " " echo "Double-paged printing:" echo " Color($DPCC): ${DOUBLECOLORPAGES:1:${#DOUBLECOLORPAGES}-1}" echo " Gray($DPGC): ${DOUBLEGRAYPAGES:1:${#DOUBLEGRAYPAGES}-1}" echo " " echo "Single-paged printing:" echo " Color($SPCC): ${COLORPAGES:1:${#COLORPAGES}-1}" echo " Gray($SPGC): ${GRAYPAGES:1:${#GRAYPAGES}-1}" #pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf 
+3
Nov 14 '12 at 15:30
source share

ImageMagick has built-in image comparison methods.

http://www.imagemagick.org/Usage/compare/#type_general

There are several Perl APIs for ImageMagick, so if you expertly combine them with a PDF to Image converter, you can find your black and white test.

+2
Mar 13 '09 at 4:31
source share

I would try to do it like this, although there may be other simpler solutions, and I'm curious to hear them, I just want to try:

  • Scroll all pages
  • Extract pages into image
  • Check image color range

For the number of pages you can translate which is effortless for Perl. This is basically a regex. He also said that:

g "(/ type) \ s? (/ Page) [/"> \ s] "

You just need to calculate how many times this regular expression appears in the PDF file, minus the time you find the string "<>" (an empty age that does not appear).

To extract an image, you can use ImageMagick to do that . Or look at this question .

Finally, to get black and white, it depends if you mean literally black and white or shades of gray. For black and white, you should only have a black and white image in the entire image. If you want to see shades of gray, now itโ€™s really not my specialty, but I think you could see if the average values โ€‹โ€‹of red, green and blue are close to each other, or if the original image is converted to shades of gray that are close to each other to friend.

I hope he gives some tips to help you move on.

+2
Mar 13 '09 at 4:34
source share



All Articles