How to extract data from PDF?

My company receives data from an external company through Excel. We export it to SQL Server to run data reports. They are now changing in PDF format, is there a way to reliably transfer data from PDF and paste it into our SQL Server 2008 database?

Is this required to write an application, or is there an automated way to do this?

+6
sql-server-2008 pdf extraction
source share
6 answers

It all depends on how they included the data in the PDF. Generally speaking, there are two possible scenarios here:

  • Data is just a text object in PDF format. You will need to use a tool to extract text from a PDF, and then paste it into your database.

  • The data is contained in the form fields in PDF format. You will need to use the tool to extract data from the form fields and insert them into your database.

Hopefully scenario # 2 applies to you, because that's exactly what PDF forms are for. Scenario # 1 is just a hack that you would only use if you had no other options. Extracting plain text from a PDF file is not as simple or accurate as you might expect.

If you get a PDF form, all you have to do is combine the correct fields in the PDF form with the corresponding fields in your database and then suck in the data. This process can be fully automated if you wrote your own application.

Does it require writing an application or is there an automated way to do this?

Yes, both of these options will require writing an application or purchasing an application. If you are writing your own application, you need to find a third-party PDF library that supports extracting data from form fields or extracting text from PDF.

+4
source share

As already mentioned, you need to write an application to do this, but ideally you can get raw data from an external company and not process the PDF.

However, if you want to extract data from PDF, I used iText and found that it is very powerful, reliable and, most importantly, free . It comes in Java and .NET - iTextSharp is the .Net version. It allows you to programmatically manipulate PDF documents, and it will place the contents of the PDF in the application you are writing.

+5
source share

Disclaimer: I am associated with the creators ByteScout PDF Extractor SDK Tool

Just wanted to share some additional real-life scenarios for extracting text data from a PDF:

  1. Searchless text-free scanned image: must be processed by the OCR engine (for example, free Tesseract from Google)
  2. XFA forms : This is a subset of PDF that is mainly supported by Adobe tools. But the data can be extracted as XML data with low level PDF processing tools like iTextSharp or similar tools.
  3. ZUGFeRD PDF files, which are only PDF documents with a copy of the form data attached as an XML file (which can be extracted using tools such as this )
  4. Text incorrectly encoded by some PDF generators (can be restored using the OCR mechanism with some acceptable error rate).
+3
source share

Using ItextSharp do the following

using System; using System.Configuration; using System.Data.SqlClient; using System.IO; using System.Text; using iTextSharp.text.pdf; protected void BtnSubmit_Click(object sender, EventArgs e) { String FilePath = @"GetFilePath"; StringBuilder sb = new StringBuilder(); PdfReader reader = new PdfReader(FilePath); PdfStamper myStamp = new PdfStamper(reader, new FileStream(FilePath + "_TMP", FileMode.Create)); AcroFields form = myStamp.AcroFields; if (form.GetField("GetFieldIdFromPDF") != null) sb.Append(form.GetField("GetFieldIdFromPDF").ToString()); } 
+1
source share

I think you will have to write an application for this. This question is about extracting data from a PDF . After that, you can export the data in excel format to save the existing import format.

0
source share

View information about the "Scraper" data from the PDF. I believe that Adobe has some tools that allow you to do this for plain text, but I have not used them.

Honestly, I will try my best to get this data in raw format from your provider.

0
source share

All Articles