How to determine the degree (in bytes) of page 1 in a linearized PDF file?

I know that I can "linearize" a PDF file, for example, using the Acrobat SDK or using commercial tools. It's also called "optimized for the web," and it rebuilds the PDF so that page 1 can load as quickly as possible. The PDF files used in this way are displayed faster because the PDF viewer does not have to wait for the entire PDF file to load.

Update: Based on the answer below, I now understand that the linearized PDF is not just reordered, but also contains metadata about its own structure in the form of a linearization dictionary.

I have an application in which I want to pre-select several PDF files (query results) in anticipation that the user wants to see one of them. It would be great if my client could download page 1 and only page 1 for each of the search results. When the user selects one of them, page 1 can be displayed instantly, and the rest can be loaded in the background.

I am looking for a general solution that can be used on the server side (Windows or Linux) to pre-process my PDF files so that I can store and maintain page 1 and the rest separately. Indeed, all I need to know is where in the PDF file the last byte should correctly display page 1. If I can have this number, everything else follows.

I looked at the ISO specification for PDF , but the file format seems too complicated to just make out where page 1 ends. On the other hand, tools that linearize PDF files should almost certainly know where page 1 ends.

I am not interested in the problems associated with serving PDF files in parts to clients; this part has already been solved, since the client is an application, not a browser, and I have full control.

I also don't think this will help me split PDF with tools like Split AP into β€œpage 1” PDF and full PDF. If I do this, I will not be able to trick the client’s viewer into thinking that this is the only PDF file, and flicker will be noticed when I replace the β€œpage 1” PDF file with the full PDF file.

Any help or pointers appreciated.

Solution (based on Bobrovsky's answer below):

A properly linearized PDF starts with a header line (defined in section 7.5.2 of the PDF specification), such as "% PDF-1.7", followed by a comment line of at least four binary characters (defined as byte values ​​of 128 or higher). For example:

%PDF-1.7 %€€€€ 

This heading is immediately followed by a linearization dictionary (defined in Appendix F of the PDF specification). Example:

  43 0 obj << /Linearized 1.0 % Version /L 54567 % File length /H [475 598] % Primary hint stream offset and length (part 5) /O 45 % Object number of first page's page object (part 6) /E 5437 % Offset of end of first page /N 11 % Number of pages in document /T 52786 % Offset of first entry in main cross-reference table (part 11) >> endobj 

In this example, the end of the first page is at the offset of byte 5437. This data structure is simple enough for parsing using almost any language. The object "43 0 obj" gives the identifier for this dictionary (43) and the generation number (always zero for linearized files). The dictionary itself is surrounded by <and β†’, between which are pairs of key values ​​(keys have slashes, such as "/ E").

And here is the C # method, which finds the corresponding number using a regular expression:

 public int GetPageOneLength(byte[] data) { // According to ISO PDF spec: "The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file" (p. 679) string preamble = new string(ASCIIEncoding.ASCII.GetChars(data, 0, 1024)); // Note that the binary section on line 2 of the header will be entirely converted to question martks ('?') var match = Regex.Match(preamble, @"<<\w*/Linearized.+/E\s+(?<offset>\d+).+>>"); if (!match.Success) throw new InvalidDataException("PDF does not have a proper linearization dictionary"); return int.Parse(match.Groups["offset"].Value); } 

Note. Bobrovsky warns that the file may contain a linearization dictionary, but may not be linearized correctly (perhaps due to incremental editing?). In my case, this is not a problem, since I linearize all the PDF files myself.

+8
c # pdf
source share
1 answer

The linearization dictionary should help you with this.

The dictionary must contain an E parameter

The offset of the end of the first page (end of part 6 in example F.1) relative to the beginning of the file.

Please note that not every file with a linearization dictionary is actually linearized (broken generators, changes after linearization, etc.). Thus, you cannot use the described approach if your files are not checked for linearization.

Please see the F.2.2 Linearization Parameter Dictionary (Part 2) in the PDF reference for more information on the linearization dictionary.

+3
source share

All Articles