Regular expression for items listed in plain English

Question

Regular expression for items listed in plain English

This is a kind of contrived example, but I'm trying to find a general principle here.

The specified phrases written in English using this list form:

I have a cat I have a cat and a dog I have a cat, a dog, and a guinea pig I have a cat, a dog, a guinea pig, and a snake

Is it possible to use a regular expression to get all the elements, no matter how many there are? Please note that items may contain multiple words.

Obviously, if I have only one, then I can use I have a (.+) , And if there are exactly two of them, I have a (.+) and a (.+) Works.

But things get complicated if I want to compare not only one example. If I want to extract list items from the first two examples, I would think that this would work: I have a (.*)(?: and a (.*))? And although this works with the first phrase, telling me that I have cat and null , for the second he tells me that I have cat and a dog and null . Things only get worse when I try to combine phrases in even more forms.

Is there a way to use regular expressions for this purpose? This seems pretty simple, and I don’t understand why my regular expression works, which matches lists of two elements, but one that matches 1- or 2-position lists doesn't.

+7

java regex

codebreaker Aug 1 '14 at 18:31

source share

4 answers

alfasin · Answer 1 · 2014-08-01T18:43:16+0000

You can use a group without capture as a conditional delimiter (either a comma or the end of a line):
' a (.*?)(?:,|$)'

An example in python:

 import re line = 'I have a cat, a dog, a guinea pig, and a snake' mat = re.findall(r' a (.*?)(?:,|$)', line) print mat # ['cat', 'dog', 'guinea pig', 'snake']

Santa · Answer 2 · 2014-08-01T18:48:28+0000

I use regex splitting to do this. But this assumes a sentence format that exactly matches your input data set:

 >>> SPLIT_REGEX = r', |I have|and|, and' >>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'): ... print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()] ... ['a cat'] ['a cat', 'a dog'] ['a cat', 'a dog', 'a guinea pig'] ['a cat', 'a dog', 'a guinea pig', 'a snake']

Casimir et Hippolyte · Answer 3 · 2014-08-01T19:37:12+0000

What you can do is use the \G binding with the find method:

 (?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?>[bz]+|\Ba|a(?!nd\b))+(?> (?>[bz]+|\Ba|a(?!nd\b))+)*)

or more simply:

 (?:\G(?!\A)(?:,? and|,)|\bI have) an? ((?!and\b)[az]+(?> (?!and\b)[az]+)*)

\G is the position in the line after the last match. The template has two entry points. In the first match, the second entry point will be used: \bI have and the following matches, the first entry point that allows only continuous results.

Note: \G means match the position after the last match, but also matches the beginning of the line. (?!\A) here to avoid this incident.

online demo

regex planet (click java button)

jawee · Answer 4 · 2014-08-02T11:28:01+0000

Provide a single java interface using a positive regex. See below:

 String str0 = "I have a cat"; String str1 = "I have a cat and a dog"; String str2 = "I have a cat, a dog, and a guinea pig"; String str3 = "I have a cat, a dog, a guinea pig, and a snake"; String regexp = "(?m)\\ba\\s+.*?(?=(?:,|$|and))"; Pattern pMod = Pattern.compile(regexp); Matcher mMod = pMod.matcher(str3); while (mMod.find()) { System.out.println(mMod.group(0)); }

For str3 output:

 a cat a dog a guinea pig a snake

if the element can be "a", "an" or "one", then the regular expression can be (?m)\\b(one|an|a)\\s+.*?(?=(?:,|$|and))

(?m) means turning on the MULTILINE flag when parsing. In multi-line mode, the expressions ^ and $ coincide immediately or immediately before that, respectively, the line terminator or the end of the input sequence. By default, these expressions correspond only to the beginning and end of the entire input sequence.

Regular expression for items listed in plain English

More articles: