Scalable regex for English digits

Question

Scalable regex for English digits

I am trying to create a regular expression to recognize English numbers , such as one , nineteen , twenty , one hundred twenty two , etc., up to millions. I want to reuse some parts of the regex, so the regex is created in parts, for example:

// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera

I was wondering if someone had already done the same (and would like to share), as these regular expressions are quite long, and it is possible that they have something that they should not, or something that I can lose. In addition, I want them to be as effective as possible , so I look forward to any optimization tips. I use the Java regex engine, but any regex flavor is acceptable.

+5

java regex perl

João Silva Aug 13 '09 at 3:12

source share

4 answers

Perl , ( , Java) . Regexp:: Assemble, Regexp:: List, Regexp:: Optimizer Regex:: PreSuf http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015, perl 5.10, perl | 'd trie.

+3

ysth 13 . '09 5:18

?

Java regex, regex (Perl, awk) . , :

, , - (.. "" ), (.. " " ). , , ( , ) , ( , , ).

. , , Java regex, .

0

Rob Jones 13 . '09 3:34

Regex is a really bad way to do this. Personally, I would simply create a small map of all known words and searches in this form. (Find each word when you find a match, determine if the words next to it match and continue until you have a number).

0

Noon silk Aug 13 '09 at 5:22

source share

Sinan Ünür · Accepted Answer · 2009-08-13T03:26:47+0000

See Perl Lingua :: EN :: Words2Nums and Lingua :: EN :: FindNumber .

In particular, the source code forLingua::EN::FindNumber contains:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;

subject to Perl Artistic License.

Regex::PreSuf, :

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "\n";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "\n";

:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

, , -)

Scalable regex for English digits

More articles: