Reading a Perl file using $ INPUT_RECORD_SEPARATOR as a regular expression

Question

Reading a Perl file using $ INPUT_RECORD_SEPARATOR as a regular expression

I am looking for a way to read a file descriptor line by line (and then execute a function on each line) with the following twist: what I want to consider as a “line” should be interrupted by variable characters and not just one character, which I define as $/ . I now, when $INPUT_RECORD_SEPARATOR or $/ does not support regular expressions or passes a list of characters that will be considered as string terminators, and this is where my problem is.

My file descriptor comes from a process process. Thus, I cannot search inside the file descriptor, and the full content is not immediately available, but is created by bit as the process progresses. I want to be able to attach things like a timestamp to each “line” that the process produces using the function that I called handler in my examples. Each line should be processed immediately after its creation by the program.

Unfortunately, I can only come up with a method that immediately starts the handler function, but it seems terribly inefficient or a method that uses a buffer, but will only result in “grouped” calls to the handler function and, therefore, for example, create incorrect timestamps.

In fact, in my particular case, my regular expression would be very simple and just read /\n|\r/ . Therefore, for this specific task, I do not even need full support for regular expressions, but simply the ability to consider more than one character as a line terminator. But $/ does not support this.

Is there an effective way to solve this problem in Perl?

Here are some quick pseudo-perl code to demonstrate my two approaches:

read step byte input file

It will look like this:

 my $acc = ""; while (read($fd, my $b, 1)) { $acc .= $b; if ($acc =~ /someregex$/) { handler($acc); $acc = ""; } }

The advantage here is that the handler is sent immediately when enough bytes are read. The downside is that we do the string addition and check the regex for every byte we read from $fd .

read the descriptor of the input file with X-byte blocks at a time

It will look like this:

 my $acc = ""; while (read($fd, my $b, $bufsize)) { if ($b =~ /someregex/) { my @parts = split /someregex/, $b; # for brevity lets assume we always get more than 2 parts... my $first = shift @parts; handler(acc . $first); my $last = pop @parts; foreach my $part (@parts) { handler($part); } $acc = $last; } }

The advantage is that we are more efficient because we only check every $bufsize byte. The downside is that the execution of the handler must wait until the $bufsize bytes are read.

+5

perl

josch Sep 23 '16 at 9:18

source share

2 answers

nwellnhof · Answer 1 · 2016-09-23T11:29:25+0000

Setting $ INPUT_RECORD_SEPARATOR to a regular expression would not help, because Perl readline also uses buffered IO. The trick is to use your second approach, but with unbuffered sysread instead of read . If you sysread from the feed, the call will return as soon as the data is available, even if the entire buffer cannot be full (at least on Unix).

josch · Answer 2 · 2016-09-23T12:52:42+0000

The nwellnhof proposal allowed me to implement a solution to this problem:

 my $acc = ""; while (1) { my $ret = sysread($fh, my $buf, 1000); if ($ret == 0) { last; } # we split with a capturing group so that we also retain which line # terminator was used # a negative limit is used to also produce trailing empty fields if # required my @parts = split /(\r|\n)/, $buf, -1; my $numparts = scalar @parts; if ($numparts == 1) { # line terminator was not found $acc .= $buf; } elsif ($numparts >= 3) { # first match needs special treatment as it needs to be # concatenated with $acc my $first = shift @parts; my $term = shift @parts; handler($acc . $first . $term); my $last = pop @parts; for (my $i = 0; $i < $numparts - 3; $i+=2) { handler($parts[$i] . $parts[$i+1]); } # the last part is put into the accumulator. This might # just be the empty string if $buf ended in a line # terminator $acc = $last; } } # if the output didn't end with a linebreak, handle the rest if ($acc ne "") { handler($acc); }

My tests show that indeed sysread will return even before reading 1000 characters if there is a pause in the input stream. The above code helps concatenate multiple messages of length 1000 and split messages with shorter length or multiple terminators correctly.

Please scream if you see an error in the code above.

Reading a Perl file using $ INPUT_RECORD_SEPARATOR as a regular expression

read step byte input file

read the descriptor of the input file with X-byte blocks at a time

More articles: