Regex for Gmail Search

I am trying to find a regex for search in Gmail, i.e.:

name:Joe surname:(Foo Bar) 

... as in this section . But with a slight difference: if there is text without key: it also breaks, so:

 foo:(hello world) bar:(-{bad things}) some text to search 

will return:

 foo:(hello world) bar:(-{bad things}) some text to search 
+4
source share
7 answers

It is not possible to capture everything you need with a single regex. The problem is that there is no reliable way to capture text that does not contain a keyword.

However, if we first take and save all the text with keywords, and then replace the regular expression (using the same regular expression) with an empty string, we immediately get the search string!

  • Take keywords and related text using the following regular expression ( see it in RegExr ):

      ([a-zA-Z] +: (?: \ ([^)] +? \) | [^ (] +)) 
  • Then replace the regular expression with the same regular expression with the full search string using an empty string. The resulting string will be text text without a keyword. Sort of:

      Regex.Replace (searchtext, @ "[a-zA-Z] +: (?: \ ([^)] +? \) | [^ (] +)", "");
    
  • Trimming spaces at the beginning and end of search text

  • Remove double (or more spaces) from the search text (can be done with regular expression replacement, replacing with one space):

      Regex.Replace (searchtext, @ "{2,}", "");
                                 ^ - notice the space :)
    
  • ????

  • PROFIT !!!

You can completely remove the spaces in the regular expression in # 2, but when working with regular expressions, I prefer to keep it as clean as possible.

+3
source

The problem that you encounter when navigating the Regex route is that you encounter problems with spaces. This may be really complex Regex, but for a simple regular expression, you will find that your search queries cannot contain spaces for keywords, for example:

Works : website: mysite user: john
Crash : site: "my awesome site": john

This will not succeed because it is space-based tokenization. So if space support is a requirement, read on ...

I would recommend either using the Lucene.NET built-in parser to give you tokens, or using a grammar and parser like GoldParser, Irony or Antlr.

This may seem too long and complicated for what you want, but writing a GoldParser grammar to do what you are doing is actually quite easy as soon as the grammar is completed. Here is a grammar example:

 "Name" = 'Spruce Search Grammar' "Version" = '1.1' "About" = 'The search grammar for Spruce TFS MVC frontend' "Start Symbol" = <Query> ! ------------------------------------------------- ! Character Sets ! ------------------------------------------------- {Valid} = {All Valid} - ['-'] - ['OR'] - {Whitespace} - [':'] - ["] - [''] {Quoted} = {All Valid} - ["] - [''] ! ------------------------------------------------- ! Terminals ! ------------------------------------------------- AnyChar = {Valid}+ Or = 'OR' Negate = ['-'] StringLiteral = '' {Quoted}+ '' | '"' {Quoted}+ '"' ! -- Field-specific terms Project = 'project' ':' ... CreatedOn = 'created-on' ':' ResolvedOn = 'resolved-on' ':' ! ------------------------------------------------- ! Rules ! ------------------------------------------------- ! The grammar starts below <Query> ::= <Query> <Keywords> | <Keywords> <SingleWord> ::= AnyChar <Keywords> ::= <SingleWord> | <QuotedString> | <Or> | <Negate> | <FieldTerms> <Or> ::= <Or> <SingleWord> | Or Negate | Or <SingleWord> | Or <QuotedString> <Negate> ::= <Negate> Negate <SingleWord> | <Negate> Negate <QuotedString> | Negate <SingleWord> | Negate <QuotedString> <QuotedString> ::= StringLiteral <FieldTerms> ::= <FieldTerms> Project | <FieldTerms> Description | <FieldTerms> State | <FieldTerms> Type | <FieldTerms> Area | <FieldTerms> Iteration | <FieldTerms> AssignedTo | <FieldTerms> ResolvedBy | <FieldTerms> ResolvedOn | <FieldTerms> CreatedOn | Project | <Description> | State | Type | Area | Iteration | CreatedBy | AssignedTo | ResolvedBy | CreatedOn | ResolvedOn <Description> ::= <Description> Description | <Description> Description StringLiteral | Description | Description StringLiteral 

This gives you search support for something like:

Allowed: john project: "amazing tfs project"

If you look at the Keywords token, you will see that it expects a single-word, OR, quoted string, or negative (NOT). The hard part arises when this definition becomes recursive, which is what you see in the <Description> .

The syntax is called EBNF , which describes the format of your language. You can write something as simple as a search query parser, or an entire computer language. The way Goldparser parses tokens will limit you as it looks forward to tokens ( LALRs ), so languages ​​such as HTML and Wiki syntax will break any grammar you are trying to write because these formats do not force you to close the / tags tokens. Antlr gives you LL (*), which more forgives the missing start tags / tokens, but this is not something you need to worry about parsing the search query.

The code folder for my grammar and C # code is in this project .

QueryParser is a class that parses the search string, the grammar file is a .grm file, the 2mb file is how Goldparser optimizes your grammar basically creates its own capabilities table. Calitha is the C # library for GoldParser, and it's easy to implement. Without writing down an even larger answer, it is difficult to describe how this is done, but it is quite simple once you have compiled the grammar, and Goldparser has a very intuitive IDE for writing grammars and a huge set of existing ones such as SQL, C #, Java and even Perl regex, I reckon.

This is not a quick fix in 1 hour, as you might get from a regular expression, although closer to 2-3 days, however you will learn the “correct” way of parsing.

+4
source

You can take a look at this question .

It contains the following Regex example:

 ^((?!hede).)*$ 

According to the author of the answers: "The regular expression above will match any line or line without breaking the line that does not contain (under) the line" hede "."

Therefore, you should be able to combine this with the information from the topic you posted and the above Regex snippet to solve your problem.

Hope this helps !!!

0
source

This may work for you.

In Java:

 p = Pattern.compile("(\\w+:(\\(.*?\\))|.+)\\s*"); m = p.matcher("foo:(hello world) bar:(-{bad things}) some text to search"); while(m.find()){ Log.v("REGEX", m.group(1)); } 

It produces:

05-25 15: 21: 06.242: V / REGEX (18203): foo: (hello world)
05-25 15: 21: 08.061: V / REGEX (18203): bar: (- {bad things})
05-25 15: 21: 09.761: V / REGEX (18203): search text

Regular expression works as long as the tags are the first and the free text is the last.
Even for tags, you can get content using m.group(2)

0
source

A simple approach here is to match the string with this pattern:

 \w+:(?:\([^)]*\)|\S+)|\S+ 

This will match:

  • \w+: is the key.
  • (?:) - more ...
    • \([^)]*\) - parentheses
    • | - or
    • \S+ are some characters that are not spaces.
  • |\S+ - or just a single word match.

Note that this pattern splits words into different matches. If you really can't handle it, you can use something like |(?:\S+(\s+(?!\w*:)[^\s:]+)*) instead of the last |\S+ .

Working example: http://ideone.com/bExFd

0
source

Another option, a little more reliable:
Here we can use a somewhat extended function of .Net templates - they save all the records of all groups. This is a useful feature to create a complete parser. Here I have included some other search functions, such as a quoted string and operators ( OR or range .. , for example):

 \A (?> \s # skip over spaces. | (?<Key>\w+): # Key: (?: # followed by: \( (?<KeyValue>[^)]*) # Parentheses \) | # or (?<KeyValue>\S+) # a single word ) | (?<Operator>OR|AND|-|\+|\.\.) | ""(?<Term>[^""]*)"" # quoted term | (?<Term>\w+) # just a word | (?<Invalid>.) # Any other character isn't valid )* \z 

Now you can easily get all tokens and their positions (you can also lock the capture of Key and KeyValue to pair them):

 Regex queryParser = new Regex(pattern, RegexOptions.IgnorePatternWhitespace); Match m = queryParser.Match(query); // single match! // ... var terms = m.Groups["Term"].Captures; 

Working example: http://ideone.com/B7tln

0
source

You do not need to solve this problem using only one regular expression. You can reuse the answer that you linked to that you provided will partially work.

The last element of the array is the only one that needs to be fixed.

Using your example, you first got:

 [ "foo:(hello world)", "bar:(-{bad things}) some text to search" ] 

The last element must be broken down into text up to the first closing parenthesis and subsequent text. Then you replace the last element with the text before and include it, and then add the text following it into the array.

 [ "foo:(hello world)", "bar:(-{bad things})", "some text to search" ] 

The following pseudo code should explain how this can be done:

 array; // Array returned when string was split using /\s+(?=\w+:)/ lastPosition = array.length-1; lastElem = array[lastPosition]; // May contain text without a key // Key is followed by an opening bracket // (check for opening bracket after semi-colon following key) if ( lastElem.match( /^[^:]*:(/ ) ) { // Need to replace array entry with key and all text up to and including // closing bracket. // Additional text needs to be added to array. maxSplitsAllowed = 1; results = lastElem.split( /)\w*/ , maxSplitsAllowed ); // White space following the bracket was included in the match so it // wouldn't be at the front of the text without a key lastKeyAndText = results[0] + ')'; // Re-append closing bracket endingTextWithoutKey = results[1]; array[lastPosition] = lastKeyAndText; // Correct array entry for last key array.append( endingTextWithoutKey ); // Append text without key // Key is not followed by a closing bracket but has text without a key // (check for white space following characters that aren't white space // characters) } else if (lastElem.match( /^[^:]*:[^\w]*\w/ )) { // Need to change array entry so that all text before first space // becomes the key. // Additional text needs to be added to array. maxSplitsAllowed = 1; results = lastElem.split( /\w+/ , maxSplitsAllowed ); lastKeyAndText = results[0]; endingTextWithoutKey = results[1]; array[lastPosition] = lastKeyAndText; // Correct array entry for last key array.append( endingTextWithoutKey ); // Append text without key } 

I suggested that brackets are needed when space characters should be included in the text following the key.

0
source

Source: https://habr.com/ru/post/1412552/


All Articles