Scalable regex for English digits

I am trying to create a regular expression to recognize English numbers , such as one , nineteen , twenty , one hundred twenty two , etc., up to millions. I want to reuse some parts of the regex, so the regex is created in parts, for example:

// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera

I was wondering if someone had already done the same (and would like to share), as these regular expressions are quite long, and it is possible that they have something that they should not, or something that I can lose. In addition, I want them to be as effective as possible , so I look forward to any optimization tips. I use the Java regex engine, but any regex flavor is acceptable.

+5
source share
4 answers

See Perl Lingua :: EN :: Words2Nums and Lingua :: EN :: FindNumber .

In particular, the source code forLingua::EN::FindNumber contains:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;

subject to Perl Artistic License.

Regex::PreSuf, :

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "\n";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "\n";

:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

, , -)

+8

Perl , ( , Java) . Regexp:: Assemble, Regexp:: List, Regexp:: Optimizer Regex:: PreSuf http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015, perl 5.10, perl | 'd trie.

+3

?

Java regex, regex (Perl, awk) . , :

, , - (.. "" ), (.. " " ). , , ( , ) , ( , , ).

. , , Java regex, .

0

Regex is a really bad way to do this. Personally, I would simply create a small map of all known words and searches in this form. (Find each word when you find a match, determine if the words next to it match and continue until you have a number).

0
source

All Articles