Search through very large files with php to extract a block very efficiently

Recently, I have had a serious headache while analyzing metadata from video files, and the part of the problem found is the neglect of various standards (or at least differences in interepretation) by video production software providers (and other reasons).

As a result, I need to be able to scan through very large video (and image) files of various formats, containers and codecs and dig up metadata. I already have FFMpeg, ExifTool Imagick and Exiv2 for processing different types of metadata in different types of files and using various other options to fill in some other spaces (please do not offer libraries or other tools, I tried all of them :)).

Now I get to scan large files (up to 2 GB each) for the XMP block (which is usually written to movie files using the Adobe package and other software). I wrote a function to do this, but I am worried that it can be improved.

function extractBlockReverse($file, $searchStart, $searchEnd) { $handle = fopen($file, "r"); if($handle) { $startLen = strlen($searchStart); $endLen = strlen($searchEnd); for($pos = 0, $output = '', $length = 0, $finished = false, $target = ''; $length < 10000 && !$finished && fseek($handle, $pos, SEEK_END) !== -1; $pos--) { $currChar = fgetc($handle); if(!empty($output)) { $output = $currChar . $output; $length++; $target = $currChar . substr($target, 0, $startLen - 1); $finished = ($target == $searchStart); } else { $target = $currChar . substr($target, 0, $endLen - 1); if($target == $searchEnd) { $output = $target; $length = $length + $endLen; $target = ''; } } } fclose($handle); return $output; } else { throw new Exception('not found file'); } return false; } echo extractBlockReverse("very_large_video_file.mov", '<x:xmpmeta', '</x:xmpmeta>'); 

This is fine at the moment, but I would really like to get the most out of php here without prejudice to my server, so I'm wondering if there is a better way to do this (or code tricks that improve it), since this approach seems a bit higher vertices for something as simple as finding a couple of lines and pulling something in between.

+4
source share
2 answers

You can use one of the quick string search algorithms - for example, Knuth-Morris-Pratt or Boyer-Moore , to find the position of the start and end tags, and then read all the data between them.

You should measure their performance, although, as is the case with such small search patterns, it may turn out that the constant of the selected algorithm is not good enough to be worth it.

+3
source

With large files, I think the most important optimization is NOT to search for a string everywhere. I do not believe that a video or image will ever have an XML smack block in the middle - or if there is one, it will probably be garbage.

Well, that’s possible - TIFF can do this, and JPEG too, and PNG; so why not video formats? But in real applications, small-sized metadata, such as XMP, is usually stored last. Less commonly, they are stored at the beginning of a file, but this is less common.

In addition, I think that most XMP blocks will not be too large (even if Adobe regularly fills them in order to “almost always” update them quickly in place).

So, my first attempt was to extract the first, say, 100 Kbytes and the last 100 Kbytes of information from a file. Then scan these two blocks for "

If the search fails, you can still perform an exhaustive search, but if it is successful, it will return in ten thousandth. And vice versa, even if this “trick” was successful only once in a thousand, it would still be useful.

+1
source

All Articles