Parse a ruby ​​pdf

I have several PDF documents in a folder that has a specific structure:

enter image description here

Now I want to be able to parse information from a PDF. Note that paragraphs have different lengths.

Obviously, I am not asking you to solve the problem for me, but I need some guidance on how this can be achieved.

I used nokogiri before and technically I need something similar, but for PDF files.

So, the pseudo-result for my example would look like this:

- ItemA
  - Title: ItemA
  - File: 123456789.pdf
  - Image: ImageA.png (the image was stored on disk)
  - Subtitle1: Content for subtitle 1
  - Subtitle2: Content for subtitle 2
  - Subtitle3: Content for subtitle 3
- TitleB
  - [...]
+4
source share
2 answers

pdf-readeris one of the solutions. But he has problems, sometimes he does not give the text in the proper format. I used it.

docsplit. "pdf-reader" "docsplit" .

, . - , .

+5

:

# gem install pdf-reader
require 'pdf-reader'

reader = PDF::Reader.new('my.pdf')

reader.pages.each do |page|
  puts page.text
end

. . script examples/extract_images.rb.

( ) . :

+3

All Articles