How to combine items in a List <string> to efficiently create new items
I have a case where I have an object name and a bunch of file names. I need to map the correct file name to the object. A file name can contain numbers and words separated by a hyphen (-) or underscore (_). I can control neither the file name nor the name of the object. For example:
10-11-12_001_002_003_13001_13002_this_is_an_example.svg The name of the object in this case is just a string representing a number
10001 I need to return true or false if the file name matches the name of the object. The different segments of the file name may coincide individually or with any combination of the two segments. In the above example, this should be true for the following cases (not every true case, just examples):
10001 10002 10003 11001 11002 11003 12001 12002 12003 13001 13002 And we must return false for this case (among others):
13003 So far I have come to the following:
public bool IsMatch(string filename, string objectname) { var namesegments = GetNameSegments(filename); var match = namesegments.Contains(objectname); return match; } public static List<string> GetNameSegments(string filename) { var segments = filename.Split('_', '-').ToList(); var newSegments = new List<string>(); foreach (var segment in segments) { foreach (var segment2 in segments) { if (segment == segment2) continue; var newToken = segment + segment2; newSegments.Add(newToken); } } return segments.Concat(newSegments).ToList(); } One or two segments together can make a match, and thatβs enough. Three or more segments should not be considered.
This works so far, but is there a better way to do this, perhaps not nested foreach loops?
Efficiency is very much dependent on the business problem you are trying to solve. Not knowing the full context / use, it is difficult to determine the most effective solution. What works for one situation will not always work for others.
I will always advocate writing working code, and then solve any performance problems later in the line (or throw more tin into the problem, since it is usually cheaper!) If you have specific performance problems, please tell us more ...
I'm going to go on a limb and say (hopefully) that you are only going to match the file name with the name of the object once per execution. If this case I consider, this approach will be approximately the fastest. In the case where you map a single file name to multiple object names, the obvious choice is to create a sort index and map to what you have already done, although I would consider different types of collections depending on your expected execution / use.
public static bool IsMatch(string filename, string objectName) { var segments = filename.Split('-', '_'); for (int i = 0; i < segments.Length; i++) { if (string.Equals(segments[i], objectName)) return true; for (int ii = 0; ii < segments.Length; ii++) { if (ii == i) continue; if (string.Equals($"{segments[i]}{segments[ii]}", objectName)) return true; } } return false; } First: do not modify debugged, working, efficient code for no reason. Your decision looks good.
However, we can make some improvements to your decision.
public static List<string> GetNameSegments(string filename) Creating an output list leads to implementation restrictions that are not required by the caller. This should be an IEnumerable<String> . In particular, since the caller in this case only cares about the first match.
var segments = filename.Split('_', '-').ToList(); Why ToList ? The list supports an array. You already have an array. Just use an array.
Since you no longer need to create a list, we can convert your two-loop solution into an iterator block:
public static IEnumerable<string> GetNameSegments(string filename) { var segments = filename.Split('_', '-'); foreach (var segment in segments) yield return segment; foreach (var s1 in segments) foreach (var s2 in segments) if (s1 != s2) yield return s1 + s2; } Much nicer. Alternatively, we might notice that this has a request structure and just returns the request:
public static IEnumerable<string> GetNameSegments(string filename) { var q1= filename.Split('_', '-'); var q2 = from s1 in q1 from s2 in q1 where s1 != s2 select s1 + s2; return q1.Concat(q2); } Again, much nicer in this form.
Now let's talk about efficiency. As often happens, we can achieve greater efficiency by increasing complexity. This code looks to be fast enough . Your example consists of nine segments. Suppose nine or ten are typical. Our solutions still consider first ten or so singletons, and then a hundred or so combinations. It's nothing; this code is probably fine. But what if we had thousands of segments and considered millions of possibilities?
In this case, we must restructure the algorithm. One possibility would be this general solution:
public bool IsMatch(HashSet<string> segments, string name) { if (segments.Contains(name)) return true; var q = from s1 in segments where name.StartsWith(s1) let s2 = name.Substring(s1.Length) where s1 != s2 where segments.Contains(s2) select 1; // Dummy. All we care about is if there is one. return q.Any(); } Your original solution is quadratic in the number of segments. It is linear; we rely on a standing order containing the operation. (This assumes, of course, that string operations are constant time, because the strings are short. If it is not, then we have another kettle of fish to fry.)
How else can winnings be obtained in the asymptotic case?
If we had the property that the collection was not a hash set, but rather a sorted list, we could do even better; we could binary search the list to find the beginning and end of the range of possible prefix matches, and then pour the list into a hashset to make suffix matches. This is still linear, but may have a lower constant coefficient.
If we become aware that the target line is small compared to the number of segments, we can attack the problem from the other end. Generate all possible combinations of sections of the target line and check if both halves are in the segment . The problem with this solution is that it is quadratic in memory usage in row size. So what we want to do is build a special hash on character sequences and use it to populate the hash table, rather than the standard string hash. I am sure you will see how the solution will go from there. I will not state the details.
If you are ready to use the MoreLINQ NuGet package, then this might be worth considering:
public static HashSet<string> GetNameSegments(string filename) { var segments = filename.Split(new char[] {'_', '-'}, StringSplitOptions.RemoveEmptyEntries).ToList(); var matches = segments .Cartesian(segments, (x, y) => x == y ? null : x + y) .Where(z => z != null) .Concat(segments); return new HashSet<string>(matches); } StringSplitOptions.RemoveEmptyEntries handles adjacent delimiters (e.g. -). Cartesian roughly equivalent to existing nested loops. Where - delete empty entries (i.e. if x == y ). Concat same as your existing Concat . Using a HashSet makes IsMatch calls faster (in IsMatch ).