How to replace the pattern of repeated characters / words only at the beginning of the line?

Question

How to replace the pattern of repeated characters / words only at the beginning of the line?

Note that this question is in the context of Julia, and therefore (as far as I know) PCRE.

Suppose you had a line like this:

"sssppaaasspaapppssss"

and you would like to individually combine the repeated characters at the end of the line (in the case of our line, the four characters "s" - that is, so that matchall returns ["s", "s", "s", "s"], and not ["ssss"]). This is easy:

 r"(.)(?=\1*$)"

This is almost trivial (and easy to use - replace(r"(.)(?=\1*$)","hell","k") will give "hekk" , and replace(r"(.)(?=\1*$)","hello","k") will give "hellk" ). And it can be generalized to repeat patterns by disabling the dot for something more complex:

 r"(\S+)(?=( \1)*$)"

which, for example, will independently match the last three instances of "abc" in "abc abc defg abc h abc abc abc" .

Which then leads to the question ... how would you fit a repeating character or pattern at the beginning of a line? In particular, the use of a regular expression in the form in which it was used above.

An obvious approach would be to change the direction of the aforementioned regular expression as r"(?<=^\1*)(.)" - but PCRE / Julia does not allow lookbehinds to have a variable length (except when it is fixed-variable, for example (?<=ab|cde) ), and thus produces an error. The next thought is to use "\ K" as something in the r"^\1*\K(.)" lines r"^\1*\K(.)" , But this is only possible to match the first character (apparently because it "is being promoted "after matching it and no longer matches the carriage).

For clarity: I'm looking for a regular expression that, for example, will result in

 replace("abc abc defg abc h abc abc abc",<regex here>,"hello")

production

 "hello hello defg abc h abc abc abc"

As you can see, it replaces each “abc” from the very beginning “hello”, but only until the first inconsistency. The converse, which I set out above, does this at the other end of the line:

 replace("abc abc defg abc h abc abc abc",r"(\S+)(?=( \1)*$)","hello")

produces

 "abc abc defg abc h hello hello hello"

+8

regex pcre lookbehind regex-lookarounds julia-lang

Glen o Jul 19 '15 at 15:31

source share

2 answers

For engines like PCRE, unfortunately, there is no way to do this without using a variable-length lookbehind.

A clean solution is impossible.
There is no \G trick that can accomplish this.

This is why the \ G anchor does not work.

With the help of an anchor, the only guarantee you have is that the last match
led forward match to be checked as equal
to the current match.

As a result, you can only globally match the N-1 duplicate from the start.

Here is the proof:

Regex:

  # (?:\G([ac]+)(?=\1)) (?: \G ( [ac]+ ) # (1) (?= \1 ) )

Input:

abcabcabcbca

Exit:

  ** Grp 0 - ( pos 0 , len 3 ) abc ** Grp 1 - ( pos 0 , len 3 ) abc ------------ ** Grp 0 - ( pos 3 , len 3 ) abc ** Grp 1 - ( pos 3 , len 3 ) abc

Conclusion:

Despite the fact that you know that Nth is there from the previous view,
Nth cannot be matched without the condition of the current view.

Sorry, and good luck!
Let me know if you find a pure regex.

+4

sln Jul 22 '15 at 10:49

source share

Casimir et Hippolyte · Accepted Answer · 2015-07-19T17:32:47+0000

You can use the \G anchor that matches the position after the previous match or at the beginning of the line. Thus, you ensure the adjacency of the results from the beginning of the line to the last event:

 \G(\S+)( (?=\1 ))?

demo

or be able to match to the end of the line:

 \G(\S+)( (?=\1(?: |\z)))?

How to replace the pattern of repeated characters / words only at the beginning of the line?

More articles: