How to boil a list to the least common lines?

Question

How to boil a list to the least common lines?

I have a HashSet<string> which I load vulgar words for filtering purposes. The problem is that there will be "Fu" on my list, and the word is fully spelled out. What I want to do is filter the list down, so it only contains "Fu", which excludes any other forms of the word from the list.

In other words, I want to delete all the lines in the list, where its substring is also an element of the list.

How can I do it?

I have the following where excludedWords is the original HashSet , but it does not work fully:

 HashSet<string> copy = new HashSet<string>(exludedWords); foreach (string w in copy) { foreach (string s in copy) { if (w.Contains(s) && w.Length > s.Length) { result.Remove(w); } } }

+4

c #

MAW74656 Oct 3 '11 at 20:28

source share

5 answers

Here is one way ...

 filter.RemoveAll(a => filter.Any(b => b != a && a.Contains(b)));

If the filter is a list and pre-populated with filters.

Edit: Didn't see what you wanted Contains instead of starting. therefore made the necessary mod.

+1

deepee1 Oct 3 '11 at 20:47

source share

Assuming you just want to throw away longer values, you can just use the IEqualityComparer<string> implementation to get a new set.

 private class ShortestSubStringComparer : IComparer<string>, IEqualityComparer<string> { public int Compare(string x, string y) { if (x == null) return (y == null) ? 0 : -1; if (y == null) return 1; Debug.Assert(x != null && y != null); if (this.Equals(x, y)) return x.Length.CompareTo(y.Length); return StringComparer.CurrentCulture.Compare(x, y); } public bool Equals(string x, string y) { if (x == null) return y == null; if (x.StartsWith(y)) return true; if (y != null && y.StartsWith(x)) return true; return false; } public int GetHashCode(string obj) { return obj.GetHashCode(); } }

And then your function can use the GroupBy function to group and select the first ordered element as follows:

 public HashSet<string> FindShortestSubString(HashSet<string> set) { var comparer = new ShortestSubStringComparer(); return new HashSet<string>(set.GroupBy(e => e, comparer).Select(g => g.OrderBy(e => e, comparer).First())); }

Or maybe Min can do the trick (that means you don't need an implementation of IComparer<string> ) ...

 public HashSet<string> FindShortestSubString(HashSet<string> set) { var comparer = new ShortestSubStringComparer(); return new HashSet<string>(set.GroupBy(e => e, comparer).Select(g => g.Min(e => e))); }

+1

Reddog Oct 3 '11 at 20:50

source share

I would advise against this type of filtering. You can save a few CPU cycles, but you will get some unforeseen consequences that can really confuse your users (or just make them crazy).

For example, suppose this is a list of vulgar words ...

Foo bar foohead stupidity

You want to filter out all these words from some content. To be effective, you remove foohead and foolery and just filter the substring foo.

You are about to filter out harmless words containing foo but not on your vulgar orignal list.

reminds me of this recent daily WTF ... (second shot)

http://thedailywtf.com/Articles/Progree-of-enail-Status.aspx

+1

Bzink Oct 3 '11 at 20:50

source share

You can use regular expressions. This is in vb, but I'm sure you can convert it.

Example:

 Imports System.Text.RegularExpressions Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load Dim InputString As String InputString = Regex.Replace(WHAT THE USER HAS ENTERED, "fu", "**") End Sub End Class

0

user959631 Oct 6 '11 at 2:00 p.m.

source share

Jon newmuis · Accepted Answer · 2011-10-03T20:35:17+0000

You must compare each word in the set with any other (completely different) word in the set. You can do this as follows (although I'm sure this is not the most efficient method, in any way):

 string[] strings = { "a", "aa", "aaa", "b", "bb", "bbb", "c", "cc", "ccc" }; List<string> results = new List<string>(strings); foreach (string str1 in strings) { foreach (string str2 in strings) { if (str1 != str2) { if (str2.Contains(str1)) { results.Remove(str2); } } } } return results;

How to boil a list to the least common lines?

More articles: