How to create a PDFBox for .Net

I saw examples for extracting text from pdf files that either use ITextSharp or PDFBox. PDFBox is apparently the most β€œreliable” method for extracting text, but requires many additional steps.

I tried to build the dll using the instructions found here , but I have no idea how to properly build the necessary files for .Net.

I'm pretty lost, can someone step by step provide "Include PDFBox in your .Net dummy application"?

+7
source share
1 answer

I finally got him to work. I outlined the steps that I took to get a working example. Hope someone finds this helpful.

Download Java JDK
Download IKVM 0.42.0.6
Download PDFBox 1.6.0-src.zip

Useful Ant Guide .

I renamed the Ant and PDFBox folders to shorten their names and move them to my C drive:

You need to set up environment variables. (Windows 7) Right-click My Computer-> Properties-> Advanced System Settings-> Environment Variables

I used the settings below, but yours will differ depending on where you installed Java and where you placed the Ant and PDF Box folders.

  Variable value
 ANT_HOME C: \ apache-ant \
 JAVA_HOME C: \ Program Files (x86) \ Java \ jdk1.7.0_01
 Path; C: \ apache-ant \ bin \ (Append semi-colon and path)

Once this is done, enter "ant" in the command window, you should get "build.xml does not exist!". if everything is configured correctly.

Modify the build.xml file inside the folder "pdfbox-1.6.0 \ pdfbox". Find the line that has Replace "." with "IKVM folder path".

I moved IKVM to "C: \ IKVM", so mine looks like this:

Open a command window and cd to "C: \ pdfbox-1.6.0 \ pdfbox" and enter "ant"

... and then a miracle happens.

The pdfbox folder should now have many new folders. The necessary DLLs are located in the bin folder. I don’t know why, but I have "-SNAPSHOT" and the end of all my files (pdfbox-1.6.0-SNAPSHOT.dll).

IKVM.GNU.Classpath (also called IKVM.OpenJDK.Classpath) no longer exists, it has been modular since the release of 0.40. Now it is available as several IKVM.OpenJDK dlls. You only need a few of them.

Create a new project in Visual Studio C #

Copy these files from the pdfBox bin folder to the bin folder in the bin folder of the Visual C # project:

  pdfbox-1.6.0-SNAPSHOT.dll
 fontbox-1.6.0-SNAPSHOT.dll
 commons-logging.dll

Copy these files from the ikvm bin folder to the bin folder in the bin folder of the Visual C # project:

  IKVM.OpenJDK.Core.dll
 IKVM.OpenJDK.SwingAWT.dll
 IKVM.OpenJDK.Text.dll
 IKVM.OpenJDK.Util.dll
 IKVM.Runtime.dll

Add links to the IKVM DLL above and create a project.

Add a link to the pdfbox dll and create the project again.

Now you are ready to write the code. The simple example below gave a nice text file from pdf input.

using System; using System.IO; using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; namespace testPDF { class Program { static void Main() { PDFtoText pdf = new PDFtoText(); string pdfText = pdf.parsePDF(@"C:\Sample.pdf"); using (StreamWriter writer = new StreamWriter(@"C:\Sample.txt")) { writer.Write(pdfText); } } class PDFtoText { public string parsePDF(string filepath) { PDDocument document = PDDocument.load(filepath); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(document); } } } } 
+21
source

All Articles