The problem that you encounter when navigating the Regex route is that you encounter problems with spaces. This may be really complex Regex, but for a simple regular expression, you will find that your search queries cannot contain spaces for keywords, for example:
Works : website: mysite user: john
Crash : site: "my awesome site": john
This will not succeed because it is space-based tokenization. So if space support is a requirement, read on ...
I would recommend either using the Lucene.NET built-in parser to give you tokens, or using a grammar and parser like GoldParser, Irony or Antlr.
This may seem too long and complicated for what you want, but writing a GoldParser grammar to do what you are doing is actually quite easy as soon as the grammar is completed. Here is a grammar example:
"Name" = 'Spruce Search Grammar' "Version" = '1.1' "About" = 'The search grammar for Spruce TFS MVC frontend' "Start Symbol" = <Query> ! ------------------------------------------------- ! Character Sets ! ------------------------------------------------- {Valid} = {All Valid} - ['-'] - ['OR'] - {Whitespace} - [':'] - ["] - [''] {Quoted} = {All Valid} - ["] - [''] ! ------------------------------------------------- ! Terminals ! ------------------------------------------------- AnyChar = {Valid}+ Or = 'OR' Negate = ['-'] StringLiteral = '' {Quoted}+ '' | '"' {Quoted}+ '"' ! -- Field-specific terms Project = 'project' ':' ... CreatedOn = 'created-on' ':' ResolvedOn = 'resolved-on' ':' ! ------------------------------------------------- ! Rules ! ------------------------------------------------- ! The grammar starts below <Query> ::= <Query> <Keywords> | <Keywords> <SingleWord> ::= AnyChar <Keywords> ::= <SingleWord> | <QuotedString> | <Or> | <Negate> | <FieldTerms> <Or> ::= <Or> <SingleWord> | Or Negate | Or <SingleWord> | Or <QuotedString> <Negate> ::= <Negate> Negate <SingleWord> | <Negate> Negate <QuotedString> | Negate <SingleWord> | Negate <QuotedString> <QuotedString> ::= StringLiteral <FieldTerms> ::= <FieldTerms> Project | <FieldTerms> Description | <FieldTerms> State | <FieldTerms> Type | <FieldTerms> Area | <FieldTerms> Iteration | <FieldTerms> AssignedTo | <FieldTerms> ResolvedBy | <FieldTerms> ResolvedOn | <FieldTerms> CreatedOn | Project | <Description> | State | Type | Area | Iteration | CreatedBy | AssignedTo | ResolvedBy | CreatedOn | ResolvedOn <Description> ::= <Description> Description | <Description> Description StringLiteral | Description | Description StringLiteral
This gives you search support for something like:
Allowed: john project: "amazing tfs project"
If you look at the Keywords
token, you will see that it expects a single-word, OR, quoted string, or negative (NOT). The hard part arises when this definition becomes recursive, which is what you see in the <Description>
.
The syntax is called EBNF , which describes the format of your language. You can write something as simple as a search query parser, or an entire computer language. The way Goldparser parses tokens will limit you as it looks forward to tokens ( LALRs ), so languages such as HTML and Wiki syntax will break any grammar you are trying to write because these formats do not force you to close the / tags tokens. Antlr gives you LL (*), which more forgives the missing start tags / tokens, but this is not something you need to worry about parsing the search query.
The code folder for my grammar and C # code is in this project .
QueryParser is a class that parses the search string, the grammar file is a .grm file, the 2mb file is how Goldparser optimizes your grammar basically creates its own capabilities table. Calitha is the C # library for GoldParser, and it's easy to implement. Without writing down an even larger answer, it is difficult to describe how this is done, but it is quite simple once you have compiled the grammar, and Goldparser has a very intuitive IDE for writing grammars and a huge set of existing ones such as SQL, C #, Java and even Perl regex, I reckon.
This is not a quick fix in 1 hour, as you might get from a regular expression, although closer to 2-3 days, however you will learn the “correct” way of parsing.