How to specify an additional capture group in this RegEx?

How can I fix this RegEx to possibly capture the file extension?

I am trying to match a string with an optional component, but something seems to be wrong. (The corresponding lines are taken from the printer log.)


My RegEx (.NET Flavor) is as follows:

.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).* ------------------------------------------- .* # Ignore some garbage in the front (header_ # Match the start of the file name, \d{10,11}_) # including the ID (10 - 11 digits) .* # Ignore the type code in the middle (_.*_\d{8}) # Match some random characters, then an 8-digit date .* # Ignore anything between this and the file extension (\.\w{3,4}) # Match the file extension, 3 or 4 characters long .* # Ignore the rest of the string 


I expect this to match strings like:

 str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]" str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt" str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]" 


Where capture groups return something like:

 $1 = header_0000000602_ $2 = _mc2e1nrobr1a3s55niyrrqvy_20081212 $3 = .doc 


Where $ 3 may be empty if the file extension is not found. $ 3 is an optional part, as you can see in str3 above.

If I add "?" until the end of the third capture group "(. \ w {3,4})?", RegEx no longer grabs $ 3 for any row. If I add "+" instead of "(. \ W {3,4}) +", RegEx no longer commits str3 at all, as you would expect.

I feel like using "?" at the end of the third capture group - this is a suitable thing, but it does not work, as I expect. I'm probably too naive in the ". *" Sections, which I use to ignore parts of a string.


Doesn't work as expected:

 .*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.* 
+4
source share
7 answers

One possibility is that the second to the last .* Is greedy. You can try changing it to:

 .*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.* ^ Added that 

This is not true, this will match the input you entered, but assumes the first . it occurs - this is the beginning of the file extension:

 .*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.* 

Edit: Remove the escaping that I had in the second regex.

+4
source

I believe the problem is the 3rd .* That you added to the annotation above: "Ignore anything between this and the file extension." It is greedy, so it will match ANYTHING. When you make the extension template optional, the third .* Matches all the way to the end of the line, which is allowed. Assuming there will NEVER be a symbol ' . 'in this extraneous bit, can you replace .* with [^.]* , and the rest, I hope, will work after recovery ? which you need to delete.

+3
source

Well,. .* Is probably the wrong way to run a regex - it will match 0 or more ( * ) single characters of something (.) ... which means that your full file name will match this alone. If you leave this out, the regex will begin to match when it reaches the header you want. You can also replace it with \w , which corresponds to a word break. I also suggest using a tool like Regex Coach so you can go through it and see what exactly is wrong and what your capture groups will be.

+2
source

Indicate in your second match that you only want to combine all the characters that do not have a period in them, then do your match for your extension.

 ".*(header_\d{10,11}_).*(_.*_\d{8})[^.]*(\.\w{3,4})?" 
+2
source

This is your correct result.

 .*?(header_\d*_).*?(_.*_.{8})[^.]*(\.\w{3,4})?.* ------------------------------------------- .*? # Prevent a greedy match (header_ # \d{10,11}_) # .*? # Prevent a greedy match (_.*_\d{8}) # [^.]* # Take everything that is NOT a period (\.\w{3,4}) # Match the extension .* # 

The implicit assumption is that the period will be the beginning of the file extension after the numbers match. The following requirements will not meet this requirement:

 string unmatched = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].foobar.txt" 

Also, when extracting groups in .NET, make sure your code looks like this:

 regex.Match(string_to_match).Groups[1].Value regex.Match(string_to_match).Groups[2].Value regex.Match(string_to_match).Groups[3].Value 

not this:

 // 0 index == string_to_match regex.Match(string_to_match).Groups[0].Value regex.Match(string_to_match).Groups[1].Value regex.Match(string_to_match).Groups[2].Value 

This is what confused me at first.

+2
source

This works for the given examples:

 ^.*?(?<header>\d+)_.*?_(?<date>\d{8}).*?(?:\.(?<ext>\w{3,4}))?[\w\s\[\]]*$ 

I assume that the text "heading" and the random characters between it and the date are not important, so they are not captured by this regular expression. I also used the .NET name capture function for clarity, but keep in mind that it is not supported in other versions of RegEx.

If the text after the file name contains any non-letter characters other than [and], the template will need to be reviewed.

+1
source

Here is what works for what you post:

 ^.*(?<header>header_\d{10,11})_.*(?<date>_[a-z0-9]+_\d{8})(\[\d+\])(?<ext>(\.[a-zA-Z0-9]{3,4})?).* 

Replacement:

 Header: $1 Date: $2 Extension: $4 

I did not use the named groups in the substitution because I could not figure out how to get TextMate to do this, but the named groups were useful for forcing capture.

+1
source

All Articles