Perl script slows as you move

I wrote a Perl script that compiles displays so that they can be viewed by the user. There are thousands of these files (DSET files) that need to be compiled, and the process takes a lot of time (4-5 hours). Displays are compiled using an external executable file (I have no information about the internal operation of this executable file).

As a solution to speed up the process, we decided to run several instances of this executable file in parallel, trying to dramatically improve performance.

When working with 16 threads, performance increases significantly, and now it takes about 1 hour, not 4-5, but the problem still remains. As the script progresses, the execution time of this executable file increases.

I checked about 1000 DSET files and tracked the runtime of an external compilation program as the Perl script progressed. Below is a graph of runtime versus time.

performance plot

As you can see, when you run the script, it takes the Perl script about 4 seconds to open the executable file, compile DSET, and then close the executable file. Once the script has processed about 500 DSETs, the time taken to compile each subsequent DSET increases. By the time the script is nearing the end, some of the DSET files take as long as 12 seconds!

The following is an example of the function that each thread executes:

# Build the displays sub fgbuilder { my ($tmp_ddldir, $outdir, $eset_files, $image_files) = @_; # Get environment variables my $executable = $ENV{fgbuilder_executable}; my $version = $ENV{fgbuilder_version }; # Create the necessary directories my $tmp_imagedir = "$tmp_ddldir\\images"; my $tmp_outdir = "$tmp_ddldir\\compiled"; make_path($tmp_ddldir, $tmp_imagedir, $tmp_outdir); # Copy the necessary files map { copy($_, $tmp_ddldir ) } @{$eset_files }; map { copy($_, $tmp_imagedir) } @{$image_files}; # Take the next DSET off of the queue while (my $dset_file = $QUEUE->dequeue()) { # Copy the DSET to the thread ddldir copy($dset_file, $tmp_ddldir); # Get the DSET name my $dset = basename($dset_file); my $tmp_dset_file = "$tmp_ddldir\\$dset"; # Build the displays in the DSET my $start = time; system $executable, '-compile' , '-dset' , $dset , '-ddldir' , $tmp_ddldir , '-imagedir', $tmp_imagedir, '-outdir' , $tmp_outdir , '-version' , $version ; my $end = time; my $elapsed = $end - $start; $SEMAPHORE->down(); open my $fh, '>>', "$ENV{fgbuilder_errordir}\\test.csv"; print {$fh} "$PROGRESS,$elapsed\n"; close $fh; $SEMAPHORE->up(); # Remove the temporary DSET file unlink $tmp_dset_file; # Move all output files to the outdir recursive_move($tmp_outdir, $outdir); # Update the progress { lock $PROGRESS; $PROGRESS++; } my $percent = $PROGRESS/$QUEUE_SIZE*100; { local $| = 1; printf "\rBuilding displays ... %.2f%%", $percent; } } return; } 

Each time through the loop, it spawns a new instance of the screen-building executable, waits for it to complete, and then closes that instance (which should free up any memory it used and resolve any problems like the ones I see).

There are 16 of these threads running in parallel, each of which issues a new DSET from the queue, compiles it, and moves the compiled display to the output directory. After the display is compiled, it proceeds to remove another DSET from the queue and restart the process until the queue is exhausted.

I scratch my head all day trying to figure out why it slows down. During the process, my RAM usage is stable and not increasing, and my CPU usage is not where it is almost maximized. Any help or understanding of what is happening here is appreciated.


EDIT

I wrote a test script to try to verify the theory that the problem is caused by a disk I / O caching problem. In this script, I took the same main body of the old script and replaced the call with the executable with my own task.

Here is what I replaced the executable file:

  # Convert the file to hex (multiple times so it takes longer! :D) my @hex_lines = (); open my $ascii_fh, '<', $tmp_dset_file; while (my $line = <$ascii_fh>) { my $hex_line = unpack 'H*', $line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; $hex_line = unpack 'H*', $hex_line; push @hex_lines, $hex_line; } close $ascii_fh; # Print to output files make_path($tmp_outdir); open my $hex_fh, '>', "$tmp_outdir\\$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; open $hex_fh, '>', "$tmp_outdir\\2$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; open $hex_fh, '>', "$tmp_outdir\\3$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; open $hex_fh, '>', "$tmp_outdir\\4$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; open $hex_fh, '>', "$tmp_outdir\\5$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; open $hex_fh, '>', "$tmp_outdir\\6$dset" or die "Failed to open file: $!"; print {$hex_fh} @hex_lines; close $hex_fh; 

Instead of calling the executable file and compiling DSET, I open each of them as a text file, do some simple processing and write several files to disk (each time I wrote several files to disk, because the executable writes more than one file to disk for each DSET which he processes). Then I controlled the processing time and evaluated the results.

Here are my results:

Processing time vs script progression

I really believe that part of my problem with another script is the disk I / O problem, but as you can see here, with the disk I / O problem that I intentionally created, the increase in processing time is not gradual. It has a sharp jump, and then the results become quite unpredictable.

In my previous script, you can see some unpredictability and write a lot of files, so I have no doubt that the problem is caused, at least in part, by the disk I / O problem, but this still does not explain why the increase in processing time is gradual and, according to apparently at a constant speed.

I believe that there is some other factor that we do not take into account.

+7
performance memory-management multithreading perl
source share
1 answer

I think you just have a disk fragmentation problem. Given that you have several streams that constantly create and delete new files of different sizes, in the end the disk space becomes very fragmented. I do not know which operating system you are using this on, I would suggest that it is Windows.

The reason you cannot reproduce this with your test tool may be due to the behavior of your external compiler tool - it probably creates an output file and then increases its size many times with different times between records that tend to create files that cover the disk space when working in multiple threads, especially if the disk usage is relatively high, for example, more than 70%. It seems that you are testing serialization of file creation, which avoids simultaneous fragmentation of the record.

Possible solutions:

  • Disk Defragmenter. Simply copying the compiled files to another partition / disk, deleting them and copying should be sufficient.
  • Run the external compiler on several different independent partitions to avoid fragmentation.
  • Make sure your file system has 50% or more free space.
  • Use an operating system that is less prone to file system fragmentation, for example. Linux
+1
source share

All Articles