Undefined regex matches

It seems like this should be a very simple regex case, but I can't figure out how to figure it out.

I would like to write a regular expression that checks if a list of specific words in a document is displayed in any order along with any of many other words in any order.

In logical logic, the check will be: If allOfTheseWords in this text and atLeastOneOfTheseWords in this text, return true.

Example
I am looking for (john and barbara) with (happy or sad). The order does not matter.

"Happy birthday john from barbara" => VALID "Happy birthday john" => INVALID 

I just can’t understand how to get the party and its part without unnecessary help, any help will be appreciated!

+4
source share
6 answers

You really do not want to use a regular expression for this, if the text is not very small, which of your description I doubt.

A simple solution is to discard all words in a HashSet, after which checking to see if a word is present becomes very quick and easy.

+4
source

If you want to do this with a regex, I would try a positive look :

 // searching for (john and barbara) with (happy or sad) "^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b" 

Performance should be comparable to performing a full text search for each word in the allOfTheseWords group separately.

+3
source

If you really need one regex then it will be very large and very slow due to return. For your specific example (John AND Barbara) AND (Happy or Sad), it will start as follows:

 \bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|...... 

Ultimately, you will need to put all the combinations in a regular expression. Sort of:

 JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ 

Again, going back would be prohibitive, as would an explosion in the number of cases. Stay away from regular expressions here.

+1
source

In your example, this is a regular expression that can help you:

Regex

 (?:happy|sad).*?john.*?barbara| (?:happy|sad).*?barbara.*?john| barbara.*?john.*?(?:happy|sad)| john.*?barbara.*?(?:happy|sad)| barbara.*?(?:happy|sad).*?john| john.*?(?:happy|sad).*?barbara 

Output

 happy birthday john from barbara => Matched Happy birthday john => Not matched 

As mentioned in other answers, regex may not be entirely appropriate here.

+1
source

It may be possible to do this with regexp, but it would be so difficult that it is better to use a different method (for example, using a HashSet, as indicated in other answers).

One way to do this with a regular expression is to compute all the permutations of the words you are looking for, and then write a regular expression that mentions all of these permutations. With 2 words, there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (Plus word boundaries), with 3 words there would be 6 permutations, and pretty soon the number of permutations would be larger than your input file.

0
source

If your data is relatively consistent and you plan to search a lot using Apache Lucene , you will get a better rating.

Using the information search methods, you first index all your documents / sentences, and then look for your words, in your example you would like to find "+ (+ john + barbara) + (sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY) "]

this approach will take some time when indexing, however, will be much faster than any regex / hashset approach (since you don't have to iterate over all the documents ...)

0
source

All Articles