Reset PDF data and spreadsheets in Excel

Question

Reset PDF data and spreadsheets in Excel

I am trying to find a good way to improve the performance of my data entry.

What I'm looking for is a way to clear data from a PDF and enter it into Excel.

In particular, the data I work with is leaflets with grocery stores. Currently, we must manually enter each transaction in the flyer into the database. Sample flyer http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551

I hope you have columns for products, prices and predefined options (loyalty cards, coupons, variety choices).

Any help would be appreciated and if I need to be more specific let me know.

+7

excel pdf ocr screen-scraping

Casey saunders Apr 25 '15 at 17:38

source share

1 answer

Kurt pfeifle · Answer 1 · 2015-04-26T13:36:36+0000

Having looked at the specific PDF associated with the OP , I have to say that this is not exactly displaying the typical table format.

It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:

Page 6 from the PDF linked in the OP

So this is not even a “nice” table, but a very ugly and inconvenient one to work with ...

Having said that, I have to add:

Retrieving even “good” tables from PDF files is generally extremely difficult ...

Standard PDF files do not contain any hint of the semantics of what they draw on the page: the only difference that the syntax gives is the difference between vector elements (lines, fills, ...), images and text.

Regardless of whether a character is part of a table or part of a string, or just a lone single character in the rest of an empty area, it is not easy to recognize programmatically by parsing the PDF source code.

To find out why a PDF file should never be considered suitable for hosting extracted structured data , see this article:

Why updating Dollars for Documents was so complicated (ProPublica-Website)

... but it works very well with TabulaPDF!

Having said the above, let me add the following:

For an excellent family of open source tools that get better and better from week to week to extract tabular data from PDF files (unless they are scanned pages) - contrary to what I said in my introductory paragraphs ! - check out TabulaPDF . See the following links:

Tabula-Extractor is written in Ruby. In the background, it uses PDFBox (which is written in Java) and several other third-party libraries. To run Tabula-Extractor, you need to install JRuby-1.7.

Install Tabula-Extractor

I am using the version of the "bleeding-edge" Tabula-Extractor directly from the GitHub source code repository. Getting it to work was very simple, because on my system JRuby-1.7.4_0 is already present:

mkdir ~/svn-stuff cd ~/svn-stuff git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

The clone included in this Git will already be the required libraries, so there is no need to install PDFBox. The command line tool is located in the /bin/ subdirectory.

Learning the command line options:

 ~/svn-stuff/git.tabula-extractor/bin/tabula -h Tabula helps you extract tables from PDFs Usage: tabula [options] <pdf_file> where [options] are: --pages, -p <s>: Comma separated list of ranges, or all. Examples: --pages 1-3,5-7, --pages 3 or --pages all. Default is --pages 1 (default: 1) --area, -a <s>: Portion of the page to analyze (top,left,bottom,right). Example: --area 269.875,12.75,790.5,561. Default is entire page --columns, -c <s>: X coordinates of column boundaries. Example --columns 10.1,20.2,30.3 --password, -s <s>: Password to decrypt document. Default is empty (default: ) --guess, -g: Guess the portion of the page to analyze per page. --debug, -d: Print detected table areas instead of processing. --format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV) --outfile, -o <s>: Write output to <file> instead of STDOUT (default: -) --spreadsheet, -r: Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --silent, -i: Suppress all stderr output. --use-line-returns, -u: Use embedded line returns in cells. (Only in spreadsheet mode.) --version, -v: Print version and exit --help, -h: Show this message

Retrieving the table the OP wants

I am not even trying to extract this ugly table from an OP monster PDF file. I will leave this as an exercise for these readers who feel adventurous enough ...

Instead, I will demonstrate how to extract a “nice” table. I will take pages 651-653 from the official specification of PDF-1.7 Here are screenshots:

I used this command:

  ~/svn-stuff/git.tabula-extractor/bin/tabula \ -p 651,652,653 -g -n -u -f CSV \ ~/Downloads/pdfs/PDF32000_2008.pdf

After importing the created CSV into LibreOffice Calc, the spreadsheet looks like this:

For me, it looks like a perfect table extract that spreads over 3 different PDF pages. (Even the new lines used in the table cells got into the spreadsheet.)

Update

Here is the ASCiinema screencast (which you can also download and replay locally on your Linux / MacOSX / Unix using the asciinema command-line asciinema ), starring tabula-extractor :

Reset PDF data and spreadsheets in Excel

Retrieving even “good” tables from PDF files is generally extremely difficult ...

... but it works very well with TabulaPDF!

Install Tabula-Extractor

Retrieving the table the OP wants

Update

More articles: