Parsing a pdf file using an interactive content page

Question

Parsing a pdf file using an interactive content page

Say we have a pdf file with clickable content. (I’m talking about chapters and subsections) How can this file be analyzed in C # and how can an application understand if it has a PDF file or not sections / contents, etc.?

This is a pdf link without an interactive table of contents https://docs.google.com/open?id=0B1EbI-EMJxmkODE1Mm5WbFpEdXc It seems that I did not find the pdf file with an interactive table of contents, but I found a guide for creating it here http: //everythingyoumightneed.blogspot. com / 2013/01 / how-to-create-pdf-with-clickable-links.html

So my question is: how can an application distinguish between what and how it can be analyzed using interactive links?

+4

c # pdf pdf-parsing c # -4.0

John demetriou Dec 30 '12 at 20:40

source share

2 answers

Since PDF is a binary format, you will need to use a pdf library, such as pdflib, to read pdf files.

pdfLib

also you can check this CodeProject site for some examples Convert PDF to text in C #

+1

Methodman Dec 30 '12 at 20:45

source share

David van Driessche · Accepted Answer · 2012-12-31T09:19:14+0000

Your problem is no different from trying to figure out where paragraphs and columns are in PDF files; PDF does not usually place a table of contents page as such. Therefore, even with a PDF library (e.g. iTextSharp marked with mkl) this will not be a trivial task.

With such a library, you can see the pages in the PDF file and the text on the pages. However, if this is a book, for example, the table of contents page may be the first, second, third or x-page page in the PDF file due to the fact that other other pages appear in front of it (cover, second cover, copyright, you name it. ..).

Thus, an algorithm for detecting whether there is a table of contents would have to find it somewhere on the first x pages of the PDF file. Since there are no standard tags that highlight text in the table of contents, this should be done by analyzing the text format on this page.

There are two things that can help (if available):

1) In many PDF files, the elements in the table are content, as you say, you can click. Thus, you can look in the PDF file and try to find the first page containing many hyperlinks.

2) In many PDF files, the table of contents is mirrored in bookmarks. Thus, you can also study the structure of bookmarks and see if you can use this to find out how many chapters are in the book.

Keep in mind that both of these functions are optional and are not standardized if present.

Parsing a pdf file using an interactive content page

More articles: