Extract text from pdf and text files

How can I extract text from pdf or text files (remove bold, images and other formatted text formatting materials) in C #?

+7
c # ms-word pdf
source share
6 answers

You can use filters designed for / used by the indexing service. They are designed to extract simple text from different documents, which is useful for searching inside a document. You can use it for Office files, PDF, HTML, etc., basically any type of file with a filter. The only drawback is that you must install these filters on the server, so if you do not have direct access to the server, this may not be possible. Some filters come pre-installed with Windows, but some, such as PDF, you need to install yourself. To implement C #, check out this article: Using IFilter in C #

+6
source share

Pdf

You have various options.

pdftotext:
Download XPDF Utilities . There are various command line utilities in the .zip file. One of them is pdftotext(.exe) . It can extract all text content from a good PDF file. Type pdftotext -help to learn about some command line options.

Ghostscript:
Install the latest version of Ghostscript (v.8.71). Ghostscript is a PostScript and PDF interpreter. You can also use it to extract text from a PDF:

 gswin32c.exe ^ -q ^ -sFONTPATH=c:/windows/fonts ^ -dNODISPLAY ^ -dSAFER ^ -dDELAYBIND ^ -dWRITESYSTEMDICT ^ -dSIMPLE ^ -f ps2ascii.ps ^ -dFirstPage=3 ^ -dLastPage=7 ^ input.pdf ^ -dQUIET 

This will output the text contained in pages 3-7 of input.pdf to standard output. You can redirect this to a file by adding > /path/to/output.txt to the command. (Make sure the PostScript ps2ascii.ps utility is present in your Ghostscript lib subdirectory.)

If you omit the -dSIMPLE , the text output will guess line breaks and word spacing. See the Comments inside the ps2ascii.ps file for more ps2ascii.ps . You can even replace this option with -dCOMPLEX for more information on formatting text.

+4
source share
0
source share

Using the Word object model is the only reliable way, since the Word format is not open and differs from version to version.

0
source share

You might want to look at the PDFBox. Here is a link to the code project page showing how to use it in C #, as well as other useful comments.

http://www.codeproject.com/KB/string/pdf2text.aspx

As for Word, the assumption about the possibility of using the Word object model is probably the most accurate.

0
source share

The Docotic.Pdf library can be used to extract text from PDF files.

The library can extract text and formatted text . In addition, a collection of words or characters with bounding boxes can be obtained using the library API.

Disclaimer: I work for a library provider.

0
source share

All Articles