Obviously, this is not an easy task, PDF formatting is much richer than HTML code (plus you have to extract images and link them, etc.).
Simple text extraction is much simpler (although not trivial ...).
I see a similar question in the sidebar of your question: Converting PDF to HTML with Python , which points to a library (poppler, which is apparently written in C ++, maybe it can be accessed with JNI / JNA) and related a question that offers even more answers.
Philho
source share