Apache POI HWPF - problem in converting doc file to pdf

I am currently working on a Java project using apache poi. Now in my project I want to convert a doc file to a pdf file. The conversion was successful, but I get the text in pdf without text or text. My pdf file looks black and white. So far, my doc file is colored and has a different text style.

This is my code,

POIFSFileSystem fs = null; Document document = new Document(); try { System.out.println("Starting the test"); fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc")); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); OutputStream file = new FileOutputStream(new File("/document/test.pdf")); PdfWriter writer = PdfWriter.getInstance(document, file); Range range = doc.getRange(); document.open(); writer.setPageEmpty(true); document.newPage(); writer.setPageEmpty(true); String[] paragraphs = we.getParagraphText(); for (int i = 0; i < paragraphs.length; i++) { org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i); // CharacterRun run = pr.getCharacterRun(i); // run.setBold(true); // run.setCapitalized(true); // run.setItalic(true); paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", ""); System.out.println("Length:" + paragraphs[i].length()); System.out.println("Paragraph" + i + ": " + paragraphs[i].toString()); // add the paragraph to the document document.add(new Paragraph(paragraphs[i])); } System.out.println("Document testing completed"); } catch (Exception e) { System.out.println("Exception during test"); e.printStackTrace(); } finally { // close the document document.close(); } } 

please help me.

Thnx in advance.

+6
java apache apache-poi hwpf
source share
2 answers

If you look at Apache Tika, there is a good example of reading some style information from an HWPF document. The code in Tika generates HTML based on the content of the HWPF, but you should find that something very similar works for your case.

Tika Class https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

It’s one thing to note the vocabulary documents - that everything in any Character Run has the same formatting applied to it. A paragraph, therefore, consists of one or more characters. Some styles are applied to the paragraph, while other parts are executed on the runs. Depending on what formatting you are interested in, it may be in a paragraph or a run.

+4
source share

If you use WordExtractor, you will get only text. Try using the CharacterRun class. You will get the style along with the text. Please refer to the sample code.

 Range range = doc.getRange(); for (int i = 0; i < range.numParagraphs(); i++) { org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i); int j = 0; while (true) { CharacterRun run = poiPara.getCharacterRun(j++); System.out.println("Color "+run.getColor()); System.out.println("Font size "+run.getFontSize()); System.out.println("Font Name "+run.getFontName()); System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode()); System.out.println("Text is "+run.text()); if (run.getEndOffset() == poiPara.getEndOffset()) { break; } } } 
+3
source share

All Articles