How to determine if a string has been compressed?

How to determine if a string was compressed using gzcompress (matching string sizes before / after calling gzuncompress , or is this the right way to do this?)

+7
source share
2 answers

A string and a compressed string are just sequences of bytes. You cannot distinguish one sequence of bytes from another sequence of bytes. You need to know if blob bytes are a compressed format or not from the accompanying metadata.

If you really need to guess software, you have a few things you can try:

  • Try unzipping the line and see if the uncompress operation succeeds. If this fails, the bytes probably do not represent a compressed string.
  • Try to check the obvious "strange" bytes somehow before 0x20 . These bytes are not commonly used in plain text. There is no real guarantee that they occur in a compressed line, though.
  • Use mb_check_encoding to find out if the string is in the encoding you suspect is in it. If this is not the case, it may be compressed (or you checked the wrong encoding). With the caution that almost any sequence of bytes is valid in almost every single-byte encoding, so this will only work for multi-byte encodings.
+8
source

PRE:
I think if you send a request , you can immediately look in $http_response_header to see if one of the elements in the array is a variation of Content-Encoding: gzip . But this is LAME!
there is a much better method.

Here's the HOW ...

Check if it has gzip. Like a BOSS!

according to GZIP RFC :

The gzip content header is as follows:

 +---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) +---+---+---+---+---+---+---+---+---+---+ 

ID1 and ID2 identify the content as gzip . And the CM indicates that ZLIB_ENCODING (compression method) is ZLIB_ENCODING_DEFLATE - which is commonly used by GZIP with all web servers.

oh! , and they have fixed values:

  • The value of ID1 is "\x1f"
  • The value of ID2 is "\x8b"
  • CM value "\x08" (or only 8 ...)

almost there:

$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");

Working example

 <?php /** @link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */ date_default_timezone_set("Asia/Jerusalem"); while (ob_get_level() > 0) ob_end_flush(); mb_language("uni"); @mb_internal_encoding('UTF-8'); setlocale(LC_ALL, 'en_US.UTF-8'); header('Time-Zone: Asia/Jerusalem'); header('Charset: UTF-8'); header('Content-Encoding: UTF-8'); header('Content-Type: text/plain; charset=UTF-8'); header('Access-Control-Allow-Origin: *'); function get($url, $cookie = '') { $html = @file_get_contents($url, false, stream_context_create([ 'http' => [ 'method' => "GET", 'header' => implode("\r\n", ['' , 'Pragma: no-cache' , 'Cache-Control: no-cache' , 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36' , 'DNT: 1' , 'Accept-Language: en-US,en;q=0.8' , 'Accept: text/plain' , 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; }))) , 'Referer: http://eladkarako.com' , 'Connection: close' , 'Cookie: ' . $cookie , 'Accept-Encoding: gzip' ]) ]])); $is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII"); return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html; } $html = get('http://www.pogdesign.co.uk/cat/'); echo $html; 

What do we see here, what is worth mentioning?

  • start by initializing the PHP engine to use UTF-8 (since we really don't know if the web server will return gzip content.
  • Providing the Accept-Encoding: gzip header, tells the web server, can output the contents of GZIP.
  • GZIP content detection (you must use ASCII encoded multibyte functions).
  • Finally, returning a simple exit, it is easy to use ZLIB methods.
+15
source

All Articles