I am currently working on a research project that includes indexing a large number of files (240 thousand); they are mainly html, xml, doc, xls, zip, rar, pdf and text with file sizes from a few kilobytes to over 100 MB.
When I extract all the zip and rar files, I get the total of a million files.
I am using Visual Studio 2010, C # and .NET 4.0 with support for TPL data stream and Async CTP V3. To extract text from these files, I use Apache Tika (converted using ikvm) and I use Lucene.net 2.9.4 as an indexer. I would like to use the new TPL data stream library and asynchronous programming.
I have a few questions:
Do I get performance benefits if I use TPL? This is basically an I / O process and, as I understand it, TPL is not very useful when you use I / O heavily.
Would a producer / consumer approach be the best way to handle this type of file processing, or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blocking selections.
Will the TPL data flow library be used for this type of process? It seems that the TPL data stream is best used in some kind of messaging system ...
Should I use asynchronous programming or stick to synchronous in this case?
c # file-io task-parallel-library tpl-dataflow async-ctp
Martijn
source share