Create an array of unique combinations from an array of strings

I am writing something that takes a block of text and breaks it into possible database queries that can be used to find similar blocks of text. (something similar to a list of similar questions) when I print this) Main process:

  • Remove stop words from text
  • Removing Special Characters
  • From the remaining text, create an array of unique "stems"
  • Create an array of possible combinations of an array of stems (where I got stuck ... sort of)

Here is what I still have:

//baseList starts with an empty array //candList starts with the array of unique stems //target is where the arrays of unique combinations are stored function createUniqueCombos(baseList,candList,target){ for(var i=0;i<candList.length;i++){ //copy the base List var newList = baseList.slice(0); //add the candidate list item to the base list copy newList.push(candList[i]); //add the new array to the target array target.push(newList); //re-call function using new array as baseList //and remaining candidates as candList var nextCandList = candList.slice(i + 1); createUniqueCombos(newList,nextCandList,target); } } 

This works, but on blocks with text larger than 25 words, it resets my browser. I understand that mathematically there can be a huge number of possible combinations. I'd like to know:

  • Is there a more efficient way to do this?
  • How can I determine the length of the min / max array?
+7
source share
3 answers

I think your logic is fundamentally wrong due to how many combinations you create.

The approach that I will take will be:

  • Separate the text into separate words (we will call this variable split_words )
  • Removing Special Characters
  • Delete short / common words (and, or, I, a); either do it in length, or more reasonably blacklisted words
  • Enter a table (e.g. blocks ) that has block_id and word columns
  • Have a SQL query like

     SELECT block_id FROM blocks WHERE word IN (split_words) GROUP BY block_id ORDER BY COUNT(*) DESC 

and then you will have a list of block_ids , which is ordered depending on how many words generally have blocks.

+1
source

Found this previous question: Algorithm for finding articles with similar text

One of the answers provided a link to an article suggesting to find how many contiguous pairs of characters are contained in both lines. [ http://www.catalysoft.com/articles/StrikeAMatch.html ]

An example is in Java, but I'm sure you can easily port JS:

 /** @return an array of adjacent letter pairs contained in the input string */ private static String[] letterPairs(String str) { int numPairs = str.length()-1; String[] pairs = new String[numPairs]; for (int i=0; i<numPairs; i++) { pairs[i] = str.substring(i,i+2); } return pairs; } /** @return an ArrayList of 2-character Strings. */ private static ArrayList wordLetterPairs(String str) { ArrayList allPairs = new ArrayList(); // Tokenize the string and put the tokens/words into an array String[] words = str.split("\\s"); // For each word for (int w=0; w < words.length; w++) { // Find the pairs of characters String[] pairsInWord = letterPairs(words[w]); for (int p=0; p < pairsInWord.length; p++) { allPairs.add(pairsInWord[p]); } } return allPairs; } /** @return lexical similarity value in the range [0,1] */ public static double compareStrings(String str1, String str2) { ArrayList pairs1 = wordLetterPairs(str1.toUpperCase()); ArrayList pairs2 = wordLetterPairs(str2.toUpperCase()); int intersection = 0; int union = pairs1.size() + pairs2.size(); for (int i=0; i<pairs1.size(); i++) { Object pair1=pairs1.get(i); for(int j=0; j<pairs2.size(); j++) { Object pair2=pairs2.get(j); if (pair1.equals(pair2)) { intersection++; pairs2.remove(j); break; } } } return (2.0*intersection)/union; } 
+1
source

Your problem can be easily solved with my class of binomial coefficients . Take a look at the code from my answer to a somewhat related issue. I donโ€™t know, it would be nice to port C # code to SQL-stored proc. It would probably be easier to port it to java or js and call the stored procs from this code.

0
source

All Articles