Reading PDF metadata in PHP

I am trying to read metadata attached to arbitrary PDF files: title, author, subject and keywords.

Is there a PHP library, preferably open source, that can read PDF metadata? If so, or if not, how can I use the library (or lack thereof) to retrieve metadata?

To be clear, I'm not interested in creating or modifying PDF files or their metadata, and I don't need PDF bodies. I looked at several libraries, including FPDF (which everyone seems to recommend), but it seems to be intended only for creating PDFs and not for extracting metadata.

+7
source share
6 answers

The Zend structure includes Zend_Pdf , which makes it very simple:

$pdf = Zend_Pdf::load($pdfPath); echo $pdf->properties['Title'] . "\n"; echo $pdf->properties['Author'] . "\n"; 

Limitations: works only with files without encryption less than 16 MB.

+7
source

I don’t know about libraries, but a simple way to achieve the same result can help in opening the file and parsing everything that happens after the last "end".

Try to open the pdf file in a text editor, the parser should not accept more than five lines.

+6
source

PDF Parser does exactly what you want, and it's pretty simple:

 $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile('document.pdf'); $text = $pdf->getDetails(); 

You can try it on the demo page.

+4
source

I was looking for the same thing today. And I came across a small PHP class at http://de77.com/ that offers a quick and dirty solution. You can load the class directly. The output is encoded by UTF-8.

The creator says:

Here is what I wrote in a PHP class that you can use to get the title, author, and page count of any PDF file. It does not use any external application - just pure PHP.

 // basic example include 'PDFInfo.php'; $p = new PDFInfo; $p->load('file.pdf'); echo $p->author; echo $p->title; echo $p->pages; 

It works for me! All thanks exclusively to the creator of the class ... well, maybe just a little thanks to me too for finding the class;)

+3
source

You can use PDFtk to extract the number of pages:

 // Windows $bin = realpath('C:\\pdftk\\bin\\pdftk.exe'); $cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'"; // Unix $cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]* 

If ImageMagick is available, you can also use:

 $cmd = "identify -format %n {$path}"; 

Run in PHP via shell_exec () :

 $res = shell_exec($cmd); 
+1
source
 <?php $sourcefile = "file path"; $stringedPDF = file_get_contents($sourcefile, true); preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title); echo $all = $title[0]; 
0
source

All Articles