How to catch roman numbers inside a string?

Question

How to catch roman numbers inside a string?

I want to catch roman numbers inside a string (numbers below 80 are good enough). I found a good basis for it in How do you match only real Roman numerals to regex? . The problem is that it concerns whole lines. I have not yet found a solution on how to detect Roman numbers inside a string, because there is nothing mandatory, each group can be optional. So far I have tried something like this:

my $x = ' some text I-LXIII iv more '; if ( $x =~ s/\b( ( (XC|XL|L?X{0,3}) # first group 10-90 | (IX|IV|V?I{0,3}) # second group 1-9 )+ ) \b/>$1</xgi ) { # mark every occurrence say $x; } __END__ ><some>< ><text>< ><>I<><-><>LXIII<>< ><>iv<>< ><more>< desired output: some text >I<->LXIII< >iv< more

Thus, he also captures the boundaries of words, because all groups are optional. How to do it? How to make one of these 2 groups mandatory, while it is not possible to determine which one is mandatory? Other approaches to catching novels are also welcome.

+6

regex perl

wk Oct 18 '12 at 8:12

source share

2 answers

You can use Roman CPAN module

 use Roman; my $x = ' some text I-LXIII VII XCVI IIIXII iv more '; if ( $x =~ s/\b ( [IVXLC]+ ) \b /isroman($1) ? ">$1<" : $1/exgi ) { say $x; }

exit:

 some text >I<->LXIII< >VII< >XCVI< IIIXII >iv< more

+4

Toto Oct 18 '12 at 9:41

source share

Borodin · Accepted Answer · 2012-10-18T11:53:32+0000

Here Perl allows us with the missing constructions of the \< and \> constructs (beginning and end of a word) that are available elsewhere. A pattern such as \b...\b will match even if ... doesn’t consume any of the target string, because the second \b will happily match the word boundary for the second time.

However, the boundary of the final word is simple (?<=\w)(?!\w) , so we can use this instead.

This program will do what you want. He looks for perspective for a string of potential Roman characters enclosed in word boundaries (so that we should be on the border of the original word), and then checks the legal Roman number, which is not followed by the word character (so now we, re at the boundary of the final word) .

Please note that I changed your labels >...< as they confuse me.

 use strict; use warnings; use feature 'say'; my $x = ' some text I-LXIII iv more '; if ( $x =~ s{ (?= \b [CLXVI]+ \b ) ( (?:XC|XL|L?X{0,3})? (?:IX|IV|V?I{0,3})? ) (?!\w) } {<$1>}xgi ) { say $x; }

Exit

 some text <I>-<LXIII> <iv> more

How to catch roman numbers inside a string?

More articles: