Shorten already short string in Java

I am looking for a way to shorten as short a line as possible.

The string is the host name: port combo and may look like "my-domain.se:2121" or "123.211.80.4:2122".

I know that regular compression is largely out of the question that lines are short due to overhead and lack of repetition, but I have an idea how to do this.

Since the alphabet is limited to 39 characters ([az] [0-9] - :.), each character can be in 6 bits. This reduces length by 25% compared to ASCII. So my suggestion in this regard is similar:

  • Encodes a string into an array of bytes, using some custom encoding
  • Decode an array of bytes into a UTF-8 or ASCII string (this string obviously does not make sense).

And then flip the process to get the original string.

So to my questions:

  • Could this work?
  • Is there a better way?
  • How?
+4
source share
6 answers

You can encode the string as base 40, which is more compact than base 64. This will give you 12 such tokens in 64-bit length. The 40th token can be the end of a string token to give you the length (since it will no longer be an integer number of bytes)

If you use arithmetic coding, it can be much smaller, but for each token you need a frequency table. (using a long list of possible examples)

class Encoder { public static final int BASE = 40; StringBuilder chars = new StringBuilder(BASE); byte[] index = new byte[256]; { chars.append('\0'); for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch); for (char ch = '0'; ch <= '9'; ch++) chars.append(ch); chars.append("-:."); Arrays.fill(index, (byte) -1); for (byte i = 0; i < chars.length(); i++) index[chars.charAt(i)] = i; } public byte[] encode(String address) { try { ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); for (int i = 0; i < address.length(); i += 3) { switch (Math.min(3, address.length() - i)) { case 1: // last one. byte b = index[address.charAt(i)]; dos.writeByte(b); break; case 2: char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]); dos.writeChar(ch); break; case 3: char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]); dos.writeChar(ch2); break; } } return baos.toByteArray(); } catch (IOException e) { throw new AssertionError(e); } } public static void main(String[] args) { Encoder encoder = new Encoder(); for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) { System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes."); } } } 

prints

 twitter.com:2122 (16 chars) encoded is 11 bytes. 123.211.80.4:2122 (17 chars) encoded is 12 bytes. my-domain.se:2121 (17 chars) encoded is 12 bytes. www.stackoverflow.com:80 (24 chars) encoded is 16 bytes. 

I leave decoding as an exercise .;)

+3
source

First of all, IP addresses are intended to be inserted in 4 bytes and port numbers in 2. The ascii representation is intended only for people, so there is no point in doing compression with this.

Your idea of ​​compressing domain name strings is doable.

+2
source

Well, in your case, I would use a specialized algorithm for your use. Recognize that you can store something other than strings. Thus, for an IPv4: port address, you will have a class that grabs 6 bytes - 4 for the address and 2 for the port. Another for type for alpha numeric hostnames. The port is always stored in two bytes. For example, the host name part itself may also have specialized .com support. Thus, the sampling hierarchy can be:

  HostPort | +----+--------+ | | IPv4 HostnamePort | DotComHostnamePort public interface HostPort extends CharSequence { } public HostPorts { public static HostPort parse(String hostPort) { ... } } 

In this case, DotComHostnamePort allows you to drop .com from the host name and save 4 characters / bytes depending on whether host names are stored in punyform or in UTF16 form.

+1
source

The first two bytes may contain the port number. If you always start with a fixed-length port number, you do not need to include the delimiter : Instead, use a bit indicating whether the IP address follows (see Karl Bielefeld's decision ) or the host name.

+1
source

You can encode them using the CDC Display code . This encoding was used in the old days, when the bit was scarce, and programmers were nervous.

+1
source

What you offer is similar to base 64 encoding / decoding, and there may be some mileage when viewing some of these implementations (6 bits are used in base 64 encoding).

As a starter if you use the Apache base 64 library

 String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes())); String y = new String(Base64.encodeBase64(x.getBytes())); System.out.println("x = " + x); System.out.println("y = " + y); 

It will shorten your string with a few characters. This obviously does not work, as you end up not with where you started.

0
source

All Articles