Library for extracting text from different file types, PDF, DOC, DOCX, TXT C #

I create an information retrieval system that searches for text in multiple file formats, I tried EPocalipse IFilter Lirary, but through an exception when trying to read docx files , and I tried the Toxy Library, although the exception for doc is Arabic files, finally I tried TikaOnDotNet Libray, but he needs to run java, and I need to put the system online on a hosting that does not have Java installed on the server

+4
source share
2 answers
+1
source

A library that can extract all text data from files of any type is an Apache Tika library . It can even extract metadata (if any) from non-text files such as image and video files. Examples of use are shown here .

+2
source

All Articles