Library for extracting text from different file types, PDF, DOC, DOCX, TXT C #

Question

Library for extracting text from different file types, PDF, DOC, DOCX, TXT C #

I create an information retrieval system that searches for text in multiple file formats, I tried EPocalipse IFilter Lirary, but through an exception when trying to read docx files , and I tried the Toxy Library, although the exception for doc is Arabic files, finally I tried TikaOnDotNet Libray, but he needs to run java, and I need to put the system online on a hosting that does not have Java installed on the server

+4

c # text information-retrieval

Alaa M. Tekleh Jul 03 '16 at 0:48

source share

2 answers

A library that can extract all text data from files of any type is an Apache Tika library . It can even extract metadata (if any) from non-text files such as image and video files. Examples of use are shown here .

+2

Debasis Jul 03 '16 at 10:48

source share