How to set an expression to a FileSpec property in a Foreach file enumerator?

I am trying to create an SSIS package to process files from a directory containing perennial files. All files are named numerically, so to save everything, I want to pass SSIS the minimum number and only list files whose name (converted to number) is above my minimum.

I tried to allow the ForEach File loop to list everything and then exclude files in the Script task, but when working with hundreds of thousands of files, this is too slow to be suitable.

The FileSpec property allows you to specify a file mask to determine which files you want in the collection, but I cannot figure out how to specify an expression to make this work, since it is essentially a string.

If the component has some kind of expression that basically says Should I Enumerate? - Yes / No Should I Enumerate? - Yes / No , it will be perfect. I experimented with the expression below, but cannot find a property to apply it.

(DT_I4) REPLACE (SUBSTRING (@ [User :: ActiveFilePath], FINDSTRING (@ [User :: ActiveFilePath], "\", 7) + 1, 100), ". Txt", ")> @ [User :: MinIndexId]? " True ":" False "

+6
source share
3 answers

From researching how the ForEach loop works in SSIS (with the goal of creating my own to solve the problem), it seems that the way it works (as far as I could see in any case) is to list the collection of files first before any mask is specified . It's hard to say exactly what happens without seeing the base code for the ForEach loop, but it seems to do it this way, which leads to poor performance when working with more than 100k files.

Although @Siva's solution is fantastically detailed and certainly improves my initial approach, it is essentially the same process, with the exception of using the Expression task to validate the file name rather than the Script task (this seems to be some improvement).

So, I decided to use a completely different approach and instead of using a file-based ForEach loop, list the collection in the Script task, apply my filtering logic and then iterate over the remaining results. This is what I did:

Sample Control Flow showing a Script Task to enumerate the files feeding into a ForEach Variable Enumerator

In my Script task, I use the asynchronous DirectoryInfo.EnumerateFiles method, which is the recommended method for large file collections, since it allows you to transfer streams, rather than waiting for the entire collection to be created before applying any logic.

Here is the code:

 public void Main() { string sourceDir = Dts.Variables["SourceDirectory"].Value.ToString(); int minJobId = (int)Dts.Variables["MinIndexId"].Value; //Enumerate file collection (using Enumerate Files to allow us to start processing immediately List<string> activeFiles = new List<string>(); System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() => { DirectoryInfo dir = new DirectoryInfo(sourceDir); foreach (FileInfo f in dir.EnumerateFiles("*.txt")) { FileInfo file = f; string filePath = file.FullName; string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1); int jobId = Convert.ToInt32(fileName.Substring(0, fileName.IndexOf(".txt"))); if (jobId > minJobId) activeFiles.Add(filePath); } }); //Wait here for completion System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask }); Dts.Variables["ActiveFilenames"].Value = activeFiles; Dts.TaskResult = (int)ScriptResults.Success; } 

So, I enumerate the collection, applying my logic when detecting files and immediately adding the file path to my list for output. Upon completion, I will assign this to an SSIS object variable named ActiveFilenames , which I will use as a collection for my ForEach loop.

I configured the ForEach loop as ForEach From Variable Enumerator , which now iterates over a much smaller collection (Post-filter List<string> compared to what I can only assume is an unfiltered List<FileInfo> or something similar in the built-in in the SSIS file ForEach File Enumerator .

Thus, the tasks inside my loop can simply be designed to process the data, since they have already been filtered before the hit loop. Although this does not seem to be very different from my original package or Siva example, in production (for this particular case, anyway), it seems that filtering the collection and enumeration asynchronously provides a significant increase in the use of the ForEach Enumerator built-in file.

I am going to continue exploring the ForEach loop container and see if I can replicate this logic in a custom component. If I get this work, I will post the link in the comments.

+2
source

Here is one way to achieve this. You can use the Expression Task in conjunction with the Foreach Loop Container to match the numerical values ​​of file names. Here is an example that illustrates how to do this. The example uses SSIS 2012 .

This may not be very effective, but it is one way to do this.

Suppose there is a folder with a bunch of files named in the format YYYYMMDD. The folder contains files for the first day of each month since 1921, such as 19210101, 19210201, 19210301 .... all in the current month 20121101. This adds up to 1,103 files.

Let's say the requirement is only to scroll through files created since June 1948. This would mean that the SSIS package should only 19480601 files greater than 19480601 .

Files

In the SSIS package, create the following three parameters. It’s better to configure the parameters for them, because these values ​​are configured in the environment.

  • ExtensionToMatch - This String data type parameter will contain the extension that the package must execute. This will add the value to the FileSpec variable, which will be used in the Foreach Container container.

  • FolderToEnumerate - This String data type parameter will save the path to the folder containing the files to be scrolled.

  • MinIndexId - this parameter of the Int32 data type will contain the minimum numerical value above which the files must match the pattern.

Parameters

Create the following four parameters to help us iterate over the files.

  • ActiveFilePath - This String data type variable will contain the file name as the Foreach Loop container goes through each file in the folder. This variable is used in the expression of another variable. To avoid an error, set it to a non-empty value, for example: 1.

  • FileCount is an Int32 dummy variable that will be used for this example to illustrate the number of files the loop loop will go through.

  • FileSpec - This String data type variable will contain the file template. Set the expression of this variable below the specified value. This expression will use the extension specified in the parameters. If there are no extensions, it will *.* Move through all files.

"*" + (@ [$ Package :: ExtensionToMatch] == "?". * ": @ [$ Package :: ExtensionToMatch])

  • ProcessThisFile - This Boolean data type variable will determine whether a particular file meets the criteria or not.

Variables

Configure the package as shown below. The Foreach loop container will go through all the files matching the pattern specified in the FileSpec variable. The expression specified in the Expression task will be evaluated at run time and will populate the ProcessThisFile variable. Then the variable will be used to limit the priority to determine whether to process the file or not.

The script task in the Foreach loop container will FileCount variable FileCount by 1 for each file that successfully matches the expression.

A script task outside the Foreach loop will simply display how many files have been looped by the Foreach loop container.

Control flow

Configure the Foreach loop container to cycle through the folder using the parameter and files using the variable.

Foreach loop collection

Store the file name in the ActiveFilePath variable as the loop goes through each file.

Foreach Loop Variable Mappings

In the Expression task, set the expression to the following value. The expression converts the file name without the extension to a number and then checks if it exceeds the value specified in the MinIndexId parameter

@ [User :: ProcessThisFile] = (DT_BOOL) ((DT_I4) (REPLACE (@ [User :: ActiveFilePath], @ [User :: FileSpec], ""))> @ [$ Package :: MinIndexId] 1: 0 )

Expression task

Right-click the Precedence constraint and configure it to use the ProcessThisFile variable in the expression. This tells the package to process the file only if it meets the condition specified in the expression task.

@ [User :: ProcessThisFile]

Precedence constraint

In the first script task, I have a User::FileCount variable set in ReadWriteVariables and the following C # code in the script task. This increases the counter for a file that successfully matches the condition.

 public void Main() { Dts.Variables["User::FileCount"].Value = Convert.ToInt32(Dts.Variables["User::FileCount"].Value) + 1; Dts.TaskResult = (int)ScriptResults.Success; } 

In the second script task, I have a User::FileCount variable set in ReadOnlyVariables and the following C # code in the script task. It simply displays the total number of processed files.

 public void Main() { MessageBox.Show(String.Format("Total files looped through: {0}", Dts.Variables["User::FileCount"].Value)); Dts.TaskResult = (int)ScriptResults.Success; } 

When a package runs with MinIndexId set to 1948061 (excluding this), it 1948061 773 .

Output 1

When a package runs with MinIndexId set to 20111201 (excluding this), it 20111201 11 .

Hope this helps.

Output 2

+12
source

The best you can do is use FileSpec to specify the mask, as you said. You can include at least some specifications in it, such as files starting with β€œ201” in 2010, 2011, and 2012. Then, in some other task, you can filter out those that you do not want to process (for example, 2010).

+1
source

All Articles