I am using Apache PDFBox 1.8.9. I have one PDF page that contains text, and I want to convert this page to an image. PDF created using Libre Office. I am using the following code:
PDDocument document = PDDocument.loadNonSeq(new File(filename), null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
int page = 0;
for (PDPage pdPage : pdPages) {
++page;
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, "png", "/home/file" + "-" + page, 300);
}
document.close();
The code works, I get a PNG image. The problem is that there are many weird extra characters that make reading text difficult. How to fix it?
The image I get is (enlarged image):

and this is the same area in the pdf viewer:

The full PDF file can be downloaded at https://yadi.sk/i/iX-KJwlhhXMY2
source
share