String compression complexity

I have a large theoretical string (104 characters) for creating a database that returns the results measured in petabytes. I don’t have that kind of processing power, so I would like to filter out low complexity strings from the database.

My grammar is a modified form of the English alphabet without numerical characters. I read about Kolmogorov complexity and how it is theoretically impossible to calculate, but I just need something basic in C # using compression.

Using these two links

  • How to measure string complexity?
  • How to determine the size of a string and compress it

I came up with this:

MemoryStream ms = new MemoryStream(); GZipStream gzip2 = new GZipStream(ms, CompressionMode.Compress, true); byte[] raw = Encoding.UTF8.GetBytes(element); gzip2.Write(raw, 0, raw.Length); gzip2.Close(); byte[] zipped = ms.ToArray(); // as a BLOB string smallstring = Convert.ToString(zipped); // as a string // store zipped or base64 byte[] raw2 = Encoding.UTF8.GetBytes(smallstring); int startsize = raw.Length; int finishsize = raw2.Length; double percent = Convert.ToDouble(finishsize) / Convert.ToDouble(startsize); if (percent > .75) { ///output } 

my first element:

HHHFHHFFHHFHHFFHHFHHHFFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH

and it shrinks to a value of 13 , but this other set of chats

mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdqmqlylflllvlfllvlfllvlf

also rated as 13 . There is a mistake, but I do not know how to fix it.

+3
string c # complexity-theory compression
source share
3 answers

Your error is the next part in which you convert the array to a string:

 byte[] zipped = ms.ToArray(); // as a BLOB string smallstring = Convert.ToString(zipped); // as a string // store zipped or base64 byte[] raw2 = Encoding.UTF8.GetBytes(smallstring); 

Calling Convert.ToString() in the array returns some debug output, in this case the string System.Byte[] . (See the following example in ideone .)

You must directly compare the lengths of the uncompressed and compressed byte array:

 int startsize = raw.Length; int finishsize = zipped.Length; 
+2
source share

Here is the code I used

 /// <summary> /// Defines an interface to calculate relevant /// to the input complexity of a string /// </summary> public interface IStringComplexity { double GetCompressionRatio(string input); double GetRelevantComplexity(double min, double max, double current); } 

And the class that implements it

 public class GZipStringComplexity : IStringComplexity { public double GetCompressionRatio(string input) { if (string.IsNullOrEmpty(input)) throw new ArgumentNullException(); byte[] inputBytes = Encoding.UTF8.GetBytes(input); byte[] compressed; using (MemoryStream outStream = new MemoryStream()) { using (var zipStream = new GZipStream( outStream, CompressionMode.Compress)) { using (var memoryStream = new MemoryStream(inputBytes)) { memoryStream.CopyTo(zipStream); } } compressed = outStream.ToArray(); } return (double)inputBytes.Length / compressed.Length; } /// <summary> /// Returns relevant complexity of a string on a scale [0..1], /// where <value>0</value> has very low complexity /// and <value>1</value> has maximum complexity /// </summary> /// <param name="min">minimum compression ratio observed</param> /// <param name="max">maximum compression ratio observed</param> /// <param name="current">the value of compression ration /// for which complexity is being calculated</param> /// <returns>A relative complexity of a string</returns> public double GetRelevantComplexity(double min, double max, double current) { return 1 - current / (max - min); } } 

Here is how you can use it.

 class Program { static void Main(string[] args) { IStringComplexity c = new GZipStringComplexity(); string input1 = "HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH"; string input2 = "mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf"; string inputMax = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; double ratio1 = c.GetCompressionRatio(input1); //2.9714285714285715 double ratio2 = c.GetCompressionRatio(input2); //1.3138686131386861 double ratioMax = c.GetCompressionRatio(inputMax); //7.5 double complexity1 = c.GetRelevantComplexity(1, ratioMax, ratio1); // ~ 0.54 double complexity2 = c.GetRelevantComplexity(1, ratioMax, ratio2); // ~ 0.80 } } 

Some additional information I found helpful.

You can try using LZMA, LZMA2 or PPMD ​​from the 7zip library . They are relatively easy to set up and provide you with an interface in which you can implement several compression algorithms. I found that these algorithms perform much better compression than GZip, but if you set the compression ratio on a scale, it doesn't really matter.

If you need a normalized value, for example, from 0 to 1, you will need to first calculate the compression ratio for all sequences. This is because you cannot be sure how maximum compression is possible.

+1
source share

Of course it will work. While you are simply comparing sizes, it really doesn't matter which compression algorithm you use. Your main concern is simply to keep track of the amount of processing power you use to compress the strings.

0
source share

All Articles