Reading Microsoft Word documents in plain text (DOC, DOCX) in Java

I am looking for something in Java to read in Word documents to process their text. All I need is text, nothing out of the ordinary. I know about Apache POI, but it does not include DOCX support right now, is there something there?

+5
source share
4 answers

If you do not need formatting information, images, and all other bizarre things, then the task is much simpler. In total there will be from 5 to 10 lines of code.

  • DOCX zip . , "document.xml". ZipInputStream . ( zip docx !)
  • SAX node body/p/r/t - voila, !

, .

+5

googling OpenXML4J. . , , - .

. . . .

+3

Try apache poi - it can handle doc, docx, xls, xlsx, ppt, pptx.

Another production-level solution is OpenOffice in headless mode, which can even be used in a server-side script.

+2
source

All Articles