Extract text from a PDF based on C ++ position

Question

Extract text from a PDF based on C ++ position

I am trying to extract text from a PDF document based on its coordinates, so I came across two concepts in the Adobe PDF Reference (chapter 5.3):

Text Positioning Operators
Text showing operators

I am currently interested in the positioning operators Td and Tm, and when using Td , you have tx and ty relative to the beginning of the current line, which is clearly indicated in the PDF: tx ty Td I used this method to extract text at the coordinates tx and ty. The problem is that I do not know how to extract text from a PDF based on its position, but only tx and ty are supplied.

 abcdef Tm

This is the "formula for using Tm". What do the af values represent? This will be my contribution for Tm:

 BT /F1 8.88 Tf 0 0 0 rg 0.9998 0 0 1 401.52 448.08 Tm [<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ ET

Why does each group of four have a leading 00? is it in the hex? should i convert it from hex to int and the corresponding character?

this will be my entry for td:

 BT 43.20 421.90 Td 0 Tw /C001 10.00 Tf 0.00 Tw <BlablablaTextInHexThatICanProcess>Tj ET

This is much clearer, the coordinates are clearer. How to extract text from a text object placed in Tm based on simple X and Y coordinates? I use C ++ and PoDoFo libraries

+5

c ++ pdf podofo

AlexandruC May 09, '13 at 13:13

source share

2 answers

Do not underestimate the scope of this task. The text matrix bit is pretty simple. The hard bit is the text itself.

Let's start with your query - why does each group of four have a leading 00?

Well PDF does not have standard text encoding - it has many, many and many. You need to know what the encoding is for the font before you can decode the text.

So in your example:

 BT /F1 8.88 Tf 0 0 0 rg 0.9998 0 0 1 401.52 448.08 Tm [<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ ET

The font is bit / F 1. This is the name that exists on the page (or parents) that refers to the font. You need to find the font and find out what the encoding is.

Given the contents of your example, I suspect that the encoding is an identification unit and that four-digit hexadecimal numbers are identifiers of characters in the font. If so, the font should have a ToUnicode entry that allows you to search for the glyph identifier and return a Unicode character.

Other fonts may or may not have a ToUnicode entry, and if this happens, there are many ways to extract Unicode text. Different methods can produce different results, so the PDF specification has the entire section, “Extracting Text Content,” which describes in detail the order in which they should be undertaken.

Hopefully your PoDoFo library should have methods for such a conversion. If the task is not difficult, and I think you should consider some other options. I wrote text extraction code for our ABCpdf.NET library, and it took several months for the code, and then several years of customization.

+3

OnceUponATimeInTheWest May 13, '13 at 10:55

source share

mkl · Accepted Answer · 2013-05-10T13:27:32+0000

First of all, when you try to extract text from PDF based on its position, when only tx and ty are delivered, it is not enough to consider only the text matrix (which you install using Tm , which you have already found). You should also consider the current transformation matrix!

I assume that when you refer to the position specified in the default user space coordinates.

To avoid device-dependent effects of specifying objects in device space, PDF defines a device-independent coordinate system that always has the same relation to the current page, regardless of the output device on which printing or display occurs. This device-independent coordinate system is called user space.
The user space coordinate system must be initialized to the default state for each page of the document. The CropBox entry in the page dictionary indicates the user space rectangle corresponding to the visible area of the intended output medium (display window or print page). The positive x axis runs horizontally to the right and the positive y axis vertically up
(section 8.3.2.3, ISO 32000-1: 2008 )

Since we only see the x and y coordinates, we see the position as a vector (x, y) in R². Inside, however, PDF files view this plane embedded in R³, with a constant z value of coordinate 1, that is, [x, y, 1]. This is due to the fact that PDF wants to allow many kinds of transformations (translations, rotations, scaling, skew, ...), but, on the other hand, wants to limit the necessary mathematical operations as much as possible. By the way, after embedding our plane as [x, y, 1] in R³, all these transformations are possible using matrix multiplications:

Here you already see these numbers a, b, c, d, e and f that you asked about.

Now, before taking into account specific transformations of the text, you should consider manipulations with the current (text independent) transformation matrix. This matrix is controlled by cm operators:

abcdef cm Change the current transformation matrix (CTM) by combining the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands define the matrix, they should be written as six separate numbers, not an array.
(clause 8.4.4, ISO 32000-1: 2008 )

This means that you need to consider all valid cm operators, that is, all those presented since the start of the page content, with the exception of those that were canceled by restoring the previous state of the graphic (see q and Q operators, pressing and restoring graphic states, section 8.4.2 , ISO 32000 -1: 2008 ).

Only now you can consider text transformation matrices:

At the beginning of the text object, Tm is the identity matrix; therefore, the start of the text space should initially be the same as for the user space. The text positioning operators described in table 108 modify Tm and thereby control the placement of glyphs, which are subsequently colored. In addition, the text display operators described in Table 109 update Tm (by changing its translation components e and f) to take into account the horizontal or vertical offset of each colored glyph, as well as any character or word parameters in the text state.
In addition, in the text object, the corresponding reader must track the matrix of the text string Tlm, which captures the value of Tm at the beginning of the line of text. Text positioning and text display operators must read and set Tlm in certain cases specified in tables 108 and 109
(clause 9.4.2, ISO 32000-1: 2008 )

Thus, inside the text object, you have to keep track of the text matrix, which is mainly set using the Tm operator, which you found with operands located in the matrix, as shown above, but also controlled as the effect of different positioning of the text and the text showing the operators.

And still there are additional parameters that determine the final position of the text, text state parameters Tfs (text font size), Th (horizontal scaling) and Trise (text enlargement), cf. Clause 9.3.1, ISO 32000-1: 2008 .

It is clear that the complete conversion from text space to device space [or in your case to the default user space] can be represented by a text visualization matrix, Trm:

Trm - time matrix; conceptually, it is recounted before each character is colored during a text operation.
(clause 9.4.2, ISO 32000-1: 2008 )

So your coordinates (x, y) are conceptually the result of the coordinates of the text space by multiplying by Trm:

[x, y, 1] = [xts, yts, 1] x Trm

where (xts, yts) are (0, 0) at the beginning of the glyphs. For each glyph printed, you have a glyph offset to get to the point where the next glyph start will be located:

The text matrix should be updated with these glyph offset values as follows:

(clause 9.4.4, ISO 32000-1: 2008 )

I have cited several paragraphs from the current PDF specification ISO 32000-1: 2008 . I find it preferable to use PDF Reference 1.4, which is quite ancient; in addition, it has been called Adobe's “inappropriate by nature” staff.

EDIT Some clarification in response to comments

device space and user space, what is the difference between the two, does the device space mean on the printer / video display? and user space to overcome the features of each device? how is the user page being the document page that I see?

Yes, the device space is a fixed coordinate system, mainly determined by the properties of the device. And yes, user space is a coordinate system independent of the target device. But no, this is not a “document page that you see” because you see it on some device (or after processing by some device).

The user space coordinate system is an independent coordinate system whose point coordinates can be converted into device coordinates using matrix multiplication with the current transformation matrix (CTM).

UserCoords x CTM = DeviceCoords

The user space coordinate system is initialized in a state where the CropBox entry in the page dictionary defines the user space rectangle corresponding to the visible area (see above) by initializing CTM, respectively.

But since the word selection already indicates (“current transformation matrix”, “coordinate system initialized”), the user space coordinate system is a dynamic, constantly changing coordinate system .

The default user space provides a consistent and reliable starting place for describing PDF pages regardless of the output device used. If necessary, the PDF content stream can change the user space to be more suitable for its needs, using the coordinate transformation operator, see (see 8.4.4, "State of Graphic Statements"). Thus, what may appear to be absolute coordinates in the content stream is not absolute with respect to the current page, as they are expressed in a coordinate system that can slide around and contract or expand. The transformation of the coordinate system not only improves the independence of the device, but is also a useful tool.
(section 8.3.2.3, ISO 32000-1: 2008 )

Thus, when a PdfReader stumbles upon the cm operator with its parameters representing some matrix M, CTM changes:

CTMnew = M x CTMold

and the coordinates present in the following operators are interpreted in accordance with this new CTMnew matrix:

UserCoords x CTMnew = DeviceCoords

So, now the user space coordinate system can be very different from the previous state, scaled, rotated, skewed, whatever.

The coordinates that are most important to you are the coordinates in the coordinate system where the user space is initialized, for example, the coordinate system of the device for a virtual device for which CTM is initialized as an identification matrix.

where the text space and character space begin and end.

The coordinates of the text are indicated in the text space. The conversion from text space to user space is determined by the text matrix in combination with several text parameters in the graphical state (see 9.4.2, "Text Positioning Operators").

The TM text matrix is initialized as a single matrix at the beginning of the text object, but changes during text operations, most noticeably when you use the Tm operator, implicitly when you use others. This matrix is controlled by the TR matrix, which contains the font size associated with the text, horizontal scaling and magnification of the text. See the text TRM for more details. In this way,

DeviceCoords = UserCoords x CTM = TextCoords x TR x TM x CTM

The conversion from glyph space to text space must be determined by a font matrix. For most types of fonts, this matrix must be predefined to display 1000 units of glyph space per 1 unit of text space; for Type 3 fonts, the font matrix must be explicitly specified in the font dictionary (see 9.6.5, “Type 3 Fonts”).

Thus, this conversion depends on the current font. The FM font matrix from the font dictionary will act as follows:

DeviceCoords = GlyphCoords x FM x TR x TM x CTM

You do not want to find the device coordinates of one glyph segment, so these coordinates are not of interest. Glyph widths, however, must be interpreted in the glyph space. However, if you are not dealing with Type 3 fonts, it just means that you need to divide them by 1000 ...

And how do the parameters w0 and w1 evolve while drawing glyphs? they are initially (0,0)

w0 and w1 denote the horizontal and vertical movements of the glyphs. In horizontal recording mode, w0 is the width of the glyphs converted to text mode (i.e., most often just divided by 1000), and w1 is 0. For the text of the vertical recording mode, it checks sections 9.2.4 and 9.7.4.3 in ISO 32000- 1: 2008 .

Does the text space have the same origin as the first glyph space? and they are updated with calculated (tx, ty)?

Since the coordinates of the glyph space are simply multiplied by the font matrix to give the coordinates of the text space and the font matrix in all cases, but type 3 fonts are simply compressed 1000 times, see above, the beginning of the glyphs is mapped to the beginning of the text space.

But tx and ty are used to update the text matrix itself. Thus, the spece text coordinate system is moved for each glyph, and for each (not Type 3) glyphic origin, the origin is displayed ... a slightly modified text space coordinate system.

Extract text from a PDF based on C ++ position

More articles: