Improve file sorting performance by extension

Given a given array of file names, the easiest way to sort by file extension is as follows:

Array.Sort(fileNames, (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y))); 

The problem is that a very long list (~ 800k) takes a lot of time to sort, and sorting by the whole file name is faster by a couple of seconds!

Theoretically, there is a way to optimize it: instead of using Path.GetExtension() and comparing newly created lines only for extension, we can provide a comparison than comparing existing lines of file names starting with LastIndexOf('.') Without creating new lines.

Now suppose I found LastIndexOf('.') , I want to reuse my own .NET StringComparer and apply it only to the part on the line after LastIndexOf('.') To preserve all aspects of the culture. Did not find a way to do this.

Any ideas?

Edit:

With the idea of ​​tanascius to use the char.CompareTo() method, I came up with my Uber-Fast-File-Extension-Comparer, now it is sorted by extension 3 times faster! it is even faster than all methods that use Path.GetExtension() in some way. what do you think?

Edit 2:

I found that this implementation does not consider culture, since the char.CompareTo() method does not consider culture, so this is not an ideal solution.

Any ideas?

  public static int CompareExtensions(string filePath1, string filePath2) { if (filePath1 == null && filePath2 == null) { return 0; } else if (filePath1 == null) { return -1; } else if (filePath2 == null) { return 1; } int i = filePath1.LastIndexOf('.'); int j = filePath2.LastIndexOf('.'); if (i == -1) { i = filePath1.Length; } else { i++; } if (j == -1) { j = filePath2.Length; } else { j++; } for (; i < filePath1.Length && j < filePath2.Length; i++, j++) { int compareResults = filePath1[i].CompareTo(filePath2[j]); if (compareResults != 0) { return compareResults; } } if (i >= filePath1.Length && j >= filePath2.Length) { return 0; } else if (i >= filePath1.Length) { return -1; } else { return 1; } } 
+1
comparison sorting c # file-extension
source share
4 answers

Create a new array containing each of the file names in the ext.restofpath format (or some kind of pair / tuple format, which can sort by extension by default without further transformation). Sort it and then convert back.

This is faster because instead of extracting the extension many times for each element (since you are doing something like N log N ), you only do it once (and then move it once).

+1
source share

You can write a comparer that compares each extension character. char also has CompareTo() ( see here ).

Basically, you loop until you have more characters left in at least one line or one CompareTo() returns the value! = 0.

EDIT: in response to OP changes

The effectiveness of your comparison method can be greatly improved. See the following code. In addition, I added a line

 string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions ); 

to enable the use of CultureInfo and CompareOptions . However, this slows things down compared to the version using simple char.CompareTo() (around factor 2). But according to my own SO question , this seems to be the way to go.

 public sealed class ExtensionComparer : IComparer<string> { private readonly CultureInfo m_CultureInfo; private readonly CompareOptions m_CompareOptions; public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {} public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions ) { m_CultureInfo = cultureInfo; m_CompareOptions = compareOptions; } public int Compare( string filePath1, string filePath2 ) { if( filePath1 == null || filePath2 == null ) { if( filePath1 != null ) { return 1; } if( filePath2 != null ) { return -1; } return 0; } var i = filePath1.LastIndexOf( '.' ) + 1; var j = filePath2.LastIndexOf( '.' ) + 1; if( i == 0 || j == 0 ) { if( i != 0 ) { return 1; } return j != 0 ? -1 : 0; } while( true ) { if( i == filePath1.Length || j == filePath2.Length ) { if( i != filePath1.Length ) { return 1; } return j != filePath2.Length ? -1 : 0; } var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions ); //var compareResults = filePath1[i].CompareTo( filePath2[j] ); if( compareResults != 0 ) { return compareResults; } i++; j++; } } } 

Using:

 fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ), CompareOptions.StringSort ) ); 
+1
source share

Not the most efficient memory, but the fastest according to my tests:

 SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>(); foreach (string fileName in fileNames) { string extension = Path.GetExtension(fileName); List<string> list; if (!dic.TryGetValue(extension, out list)) { list = new List<string>(); dic.Add(extension, list); } list.Add(fileName); } string[] arr = dic.Values.SelectMany(v => v).ToArray(); 

Was the 800k mini-test randomly generated by 8.3 file names:

Sorting elements with Linq for objects ... 00: 00: 04.4592595

Sort items with SortedDictionary ... 00: 00: 02.4405325

Sort items using Array.Sort ... 00: 00: 06.6464205

+1
source share

The main problem is that you call Path.GetExtension several times for each path. if quicksort does this, you can expect Path.GetExtension to be called anywhere from log (n) to n times, where n is the number of elements in your list for each element in the list. So you want to cache Path.GetExtension calls.

If you are using linq, I would suggest something like this:

 filenames.Select(n => new {name=n, ext=Path.GetExtension(n)}) .OrderBy(t => t.ext).ToArray(); 

this ensures that Path.GetExtension is called only once for each file name.

0
source share

All Articles