Well, I used Tika to extract the source text from pdf (it is based on the PDFBox), but I think that Tika is only useful when you need to extract text from different file formats (automatic detection helps a lot).
If you want to parse only text in text, I would suggest PDFTextStream because it is a much better parser than other apis (e.g. iText and PDFBox).
Using PDFTextStream, you can easily get structured text (pages-> blocks-> lines-> textUnits), and it gives you the ability to retrieve correlated information such as character encoding, height, location of the character on the page, etc.
Example:
public class ExtractTextAllPages { public static void main (String[] args) throws IOException { String pdfFilePath = args[0]; PDFTextStream pdfts = new PDFTextStream(pdfFilePath); StringBuilder text = new StringBuilder(1024); pdfts.pipe(new OutputTarget(text)); pdfts.close(); System.out.printf("The text extracted from %s is:", pdfFilePath); System.out.println(text); } }
yeaaaahhhh..hamf hamf
source share