Work with a very large number of files

I am currently working on a research project that includes indexing a large number of files (240 thousand); they are mainly html, xml, doc, xls, zip, rar, pdf and text with file sizes from a few kilobytes to over 100 MB.

When I extract all the zip and rar files, I get the total of a million files.

I am using Visual Studio 2010, C # and .NET 4.0 with support for TPL data stream and Async CTP V3. To extract text from these files, I use Apache Tika (converted using ikvm) and I use Lucene.net 2.9.4 as an indexer. I would like to use the new TPL data stream library and asynchronous programming.

I have a few questions:

  • Do I get performance benefits if I use TPL? This is basically an I / O process and, as I understand it, TPL is not very useful when you use I / O heavily.

  • Would a producer / consumer approach be the best way to handle this type of file processing, or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blocking selections.

  • Will the TPL data flow library be used for this type of process? It seems that the TPL data stream is best used in some kind of messaging system ...

  • Should I use asynchronous programming or stick to synchronous in this case?

+8
c # file-io task-parallel-library tpl-dataflow async-ctp
source share
2 answers

async / await definitely helps when dealing with external resources - these are usually web requests, the file system, or db operations. An interesting problem is that you need to fulfill several requirements at the same time:

  • consume as little CPU as possible (this will help async / await)
  • perform several operations simultaneously, in parallel
  • control the number of tasks that are started (!). If you do not take this into account, you will probably end the stream when working with many files.

You can take a look at the small project that I posted on github:

Parallel tree walker

It is able to efficiently list any number of files in a directory structure. You can define an async operation to execute on each file (in your indexing case), while controlling the maximum number of files that are processed at the same time.

For example:

await TreeWalker.WalkAsync(root, new TreeWalkerOptions { MaxDegreeOfParallelism = 10, ProcessElementAsync = async (element) => { var el = element as FileSystemElement; var path = el.Path; var isDirectory = el.IsDirectory; await DoStuffAsync(el); } }); 

(if you cannot use the tool directly as a dll, you can still find useful examples in the source code)

+1
source share

You can use "All Search" . The SDK is open source and has a C # example. This is the fastest way to index files on Windows that I have seen.

From the FAQ :

1.2 How long does it take to index my files?

All uses file and folder names and usually takes a few seconds to create its database. A new installation of Windows XP SP2 (about 20,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

I'm not sure if you can use TPL with it.

0
source share

All Articles