Effectively compress lines of 10-1000 characters in Java?

I need to compress strings (written in a well-known but variable language) from 10 to 1000 characters into separate UDP packets.

What compression algorithms available in Java are good for this task?

Are there open source Java libraries available?

+7
source share
4 answers

"It depends".

I would start with the main candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + a little big wrapper, zip: deflate + even more wrapper), bzip2 (I doubt that it would be so good here works best with a relative large window), perhaps even one of the other LZ * branches, such as LZS, which have RFCs for compressing IP traffic , but ...

... do some evidence-based analysis and compression / throughput using several different approaches. Java has a GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", it is recommended to use gzip or zip "wrappers"), and there are LZMA Java implementations (you just need a compressor, not a container), so all this should be trivial for layout.

If there is regularity between the packets, perhaps this could be used — for example, building cache mappings, Huffman tables, or just modifying the “windows” of one of the other algorithms, but the probability of packet loss and “compressibility” should probably be taken into account. Descent along this route adds much greater complexity. More ideas to help the compressor can be found in SO: How to find a good / optimal dictionary for zlib 'setDictionary' when processing a given dataset? .

In addition, the protocol should have a simple “return” of zero compression, because some [especially small random] data may be practically incompressible or may “compress” to a larger size (zlib actually has this protector, but also has “service wrapper data ", so it will be better encoded separately for very small data). The overhead of the wrapper for compressed data — for example, gzip or zip — should also be taken into account for such small sizes. This is especially important for considering string data of less than ~ 100 characters.

Happy coding.


Another thing to keep in mind is the encoding used to translate characters into the output stream. First I have to start with UTF-8, but that may not always be perfect.


See SO: The best compression algorithm for short text strings that SMAZ offers, but I don’t know how this algorithm will go unicode / binary.


Also consider that not all deflate (or other format) versions are created equal. I don't refer to standard Java deflation compared to third parties (say JZlib ) in terms of performance for small data, but consider Small Payload Compression [.NET] , which shows pretty negative numbers for the "same compression" format. The article is also beautiful:

... it is usually most beneficial to compress it anyway and determine which payload (compressed or uncompressed) is the smallest and includes a small token to indicate whether decompression is required.

My final conclusion: always check with real data and measure the benefits, or you may be a little surprised at the end!

Happy coding. Actually this time.

+8
source

The simplest task would be a GZIPOutputStream layer on top of the ByteArrayOutputStream, as it is built into the JDK using

ByteArrayOutputStream baos = new ByteArrayOutputStream(); GZIPOutputStream zos = new GZIPOutputStream(baos); zos.write(someText.getBytes()); zos.finish(); zos.flush(); byte[] udpBuffer = baos.toByteArray(); 

There may be other algorithms that do the best job, but I would try it first to make sure it meets your needs, as it does not require additional cans and does a pretty good job.

+5
source

Most standard compression algorithms do not work so well with small amounts of data. Often there is a header and checksum, and compression takes time. That is, he creates a data dictionary based on the data that he saw.

For this reason, you may find that

  • small packets may be smaller or the same size without compression.
  • simpler application / protocol compression
  • you need to provide a pre-prepared data dictionary for the compression algorithm and cut out the headers as much as possible.

I usually use the second option for small data packets.

+5
source

good compression algorithm for short lines / url is the lzw implementation, it is in java and can be easily ported for the gwt client: https://code.google.com/p/lzwj/source/browse/src/main/java/by /dev/madhead/lzwj/compress/LZW.java

some comments

  • use the 9-bit codeword length for small lines (although you can try which is better). the initial ratio is from 1 (very small lines, compressed no more than the original line) to 0.5 (larger lines)
  • in the case of client gwt for other codewords, it was necessary to adjust the input / output processing to work on the basis of each byte, in order to avoid errors when buffering a sequence of bits into a long one, which is emulated for js.

I use it for complex encoding of URL parameters in client gwt along with base64 encoding and autobean serialization for json.

upd: the base64 implementation is here: http://www.source-code.biz/base64coder/java you have to change it to make the URL safe, i.e. change the following characters:

'+' → '-' '/' → '~' '=' → '_'
+1
source

All Articles