How to index Word 2003, 2007, and 2010 documents using Lucene.NET

I am writing a custom Lucene.NET indexer to enable indexing of MS Word documents. The indexer should be able to handle the last three issues of MS Word: 2010, 2007 and 2003.

It is planned to use VSTO interop assemblies that are installed as part of VS2010 to extract text content from documents.

Is there a better way to index Word documents? Does this mean that I will have to install all three versions of Word on the server? Or just Word 2010?

Tools / Environment:

  • Lucene.NET 2.3.1.3
  • VS2010 / .NET 3.5
  • Windows 2008 / IIS 7

Note. For more information on how to implement this, see Sitecore Text Search in PDF or Word documents

+5
source share
1 answer

You can use IFilter plugins so that you can retrieve the contents of documents and then index them. The interface is initially part of the Microsoft Index Service, but is usually available for indexing documents.

I studied technology a couple of years ago and it seems I remember that either the filters for Office documents were built into Windows, or they could be installed separately from the full Office suite, but I could be wrong here.

IFilter IFilter IFilter MSDN. P/Invoke IFilter pinvoke.net.

# MSDN.

+5

All Articles