Finding a sequence of bytes in a binary file in PHP?

I want to find a specific sequence of bytes in a binary using PHP. I represented this sequence in hexadecimal not to type too many 0 and 1. Search sequence 0x4749524f . This is the working solution that I came up with now:

 $mysequence = "4749524f"; $f = fopen($filename, "r") or die("Unable to open file!"); while(!feof($f)){ $seq = fread($f, 4); if(bin2hex($seq) == $mysequence){ echo "found!"; break; } else if(!feof($f)) fseek($f, -3, SEEK_CUR); } 

What makes the algorithm simple:

  • Read 4 bytes
  • Check if they match the sequence
  • If they are equal β†’ found! Stop execution.
  • If they are not equal, and I'm not at the end of the file, return 3 bytes to the file and repeat step 1.

Why am I returning by 3 bytes? Because if this is the contents of the file:

 0000 4749 524f 0000 01b0 0013 

If I do not return 3 bytes, I will read 0000 4749 in the first iteration, 524f 0000 in the second, 01b0 0013 in the third and, as you can see, I skipped the sequence.

Problem: it is slow as hell ... The application will have to work with files up to 50 MB in size, so this sequence will take forever.

Is there an optimized function in PHP that would do the job? Is there a faster (not as stupid as mine) way to do this?

+5
source share
2 answers

Reading from disk always takes a lot of time. You cannot count on disk caching. This is an OS thing. Instead, do your "caching" as it were. Read in a long set of bytes, something like 1M (or more). This reduces disk reading. Then do a memory search. When reading the next 1 MB, be sure to add the last 3 bytes of the previous set to it. Search for each set until it is found. The actual size of your read should be the balance between using RAM and reading a disk.

+1
source

First of all, your $mysequence does not change during the search, so you can call hex2bin($mysequence) once and directly compare it with $seq .

As for performing this action faster, you can try to read and find the string in large buffers. Larger buffer => faster search, but more memory required. A quick draft of the code as it should look:

 $mysequence = "4749524f"; $searchBytes = hex2bin($mysequence); $crossing = 1 - length($searchBytes); // - (length - 1); see below $buf = ''; $buflen = 10000; $f = fopen($filename, "r") or die("Unable to open file!"); while(!feof($f)) { $seq .= fread($f, $buflen); if(strpos($seq, $searchBytes) === false) // strict comparation here. zero can be returned! { // keep last n-1 bytes, because they can be beginning of required sequence $seq = substr($seq, $crossing); } else { echo "found!"; break; } } unset($seq); // no need to keep this in memory any more 
+3
source

All Articles