Parse a ruby pdf

Question

Parse a ruby pdf

I have several PDF documents in a folder that has a specific structure:

enter image description here

Now I want to be able to parse information from a PDF. Note that paragraphs have different lengths.

Obviously, I am not asking you to solve the problem for me, but I need some guidance on how this can be achieved.

I used nokogiri before and technically I need something similar, but for PDF files.

So, the pseudo-result for my example would look like this:

- ItemA
  - Title: ItemA
  - File: 123456789.pdf
  - Image: ImageA.png (the image was stored on disk)
  - Subtitle1: Content for subtitle 1
  - Subtitle2: Content for subtitle 2
  - Subtitle3: Content for subtitle 3
- TitleB
  - [...]

+4

ruby scripting parsing pdf ocr

Besi Jan 24 '15 at 14:16

source share

2 answers

Shweta · Answer 1 · 2015-01-24T15:13:27+0000

pdf-readeris one of the solutions. But he has problems, sometimes he does not give the text in the proper format. I used it.

docsplit. "pdf-reader" "docsplit" .

, . - , .

Besi · Answer 2 · 2015-01-24T14:32:50+0000

:

# gem install pdf-reader
require 'pdf-reader'

reader = PDF::Reader.new('my.pdf')

reader.pages.each do |page|
  puts page.text
end

. . script examples/extract_images.rb.

( ) . :

, RMagick Mini Magick.

Parse a ruby ​​pdf

More articles:

Parse a ruby pdf