Update: It looks like the fields are actually split into a tab, not spaces. If this is guaranteed, just divide by \t .
First, let's see why (".*?"|\S+) doesnβt work. In particular, see ".*?" This means that double quotation marks contain zero or more characters. Well, the field that gives you problems is ""C:\Program Files\ABC\ABC XYZ"" . Please note that each "" at the beginning and end of this field will correspond to ".*?" because "" consists of null characters surrounded by double quotes.
Better match as concretely as possible than split. So, if you have a configuration file with directives and a fixed format, match the regular expression that is as close to the format you are trying to match as possible.
Move the quotation marks outside the brackets if you do not want them.
#!/usr/bin/perl use strict; use warnings; my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30}; my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})}; use Data::Dumper; print Dumper \@parts;
Output:
$VAR1 = [ 'StartProgram', '1', '""C:\\Program Files\\ABC\\ABC XYZ""', 'CleanProgramTimeout', '1', '30' ];
In this vein, a script is involved here:
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @strings = split /\n/, <<'EO_TEXT'; StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30 StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30 EO_TEXT my $re = qr{ (?<directive>StartProgram)\s+ (?<instance>[0-9][0-9]?)\s+ (?<path>"".+?""|\S+)\s+ (?<timeout_directive>CleanProgramTimeout)\s+ (?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2}) }x; for (@strings) { if ( $_ =~ $re ) { print Dumper \%+; } }
Output:
$VAR1 = { 'timeout_directive' => 'CleanProgramTimeout', 'timeout_seconds' => '30', 'path' => '""C:\\Program Files\\ABC\\ABC XYZ""', 'directive' => 'StartProgram', 'timeout_instance' => '1', 'instance' => '1' }; $VAR1 = { 'timeout_directive' => 'CleanProgramTimeout', 'timeout_seconds' => '30', 'path' => 'c:\\opt\\perl', 'directive' => 'StartProgram', 'timeout_instance' => '1', 'instance' => '1' };
Update: I cannot get Text::Balanced or Text::ParseWords to Text::ParseWords this correctly. I suspect the problem is with repeated quotes that limit the substring that should not be split. The following code is my best (not very good) attempt to solve a general problem by splitting and then selectively reassembling parts of the string.
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30}; my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30}; print Dumper parse_line($s); print Dumper parse_line($t); sub parse_line { my ($line) = @_; my @parts = split /(\s+)/, $line; my @real_parts; for (my $i = 0; $i < @parts; $i += 1) { unless ( $parts[$i] =~ /^""/ ) { push @real_parts, $parts[$i] if $parts[$i] =~ /\S/; next; } my $part; do { $part .= $parts[$i++]; } until ($part =~ /""$/); push @real_parts, $part; } return \@real_parts; }