Parsing a known length UTF-8 string in Common Lisp one byte at a time

Question

Parsing a known length UTF-8 string in Common Lisp one byte at a time

I am writing a program in Common Lisp for editing created Minecraft binaries that use the NBT format, documented here: http://minecraft.gamepedia.com/NBT_format?cookieSetup=true (I know that such tools exist, for example NBTEditor and MCEdit but none of them are written in Common Lisp, and I thought this project would be a great exercise to learn).

So far, one of the things that I have not been able to implement on my own is a function for reading a UTF-8 string of known length, which contains characters that are represented using more than one octet (that is, characters other than ASCII). In NBT format, each line is encoded in UTF-8 and is preceded by a short (two octets) integer n , indicating the length of the line. Therefore, assuming that only ASCII characters are present in the string, I can simply read the sequence of n octets from the stream and convert them to a string, using this:

 (defun read-utf-8-string (string-length byte-stream) (let ((seq (make-array string-length :element-type '(unsigned-byte 8) :fill-pointer t))) (setf (fill-pointer seq) (read-sequence seq byte-stream)) (flexi-streams:octets-to-string seq :external-format :utf-8)))

But if one or more characters have a character code greater than 255, it is encoded in two or more bytes, as shown in this example:

 (flexi-streams:string-to-octets "wife" :external-format :utf-8) ==> #(119 105 102 101) (flexi-streams:string-to-octets "" :external-format :utf-8) ==> #(208 182 208 181 208 189 208 176)

Both lines have the same length, but each character of the Russian word is encoded twice by the number of octets, so the total size of the string is two times larger than the English. Knowing the length of a string does not help when using a read sequence. Even if the size of the string (i.e., the number of octets needed to encode it), it was known that there would still be no way to find out which of these octets needed to be converted to a personal form separately and grouped for conversion. So instead of flipping my own function, I tried to find a way to get either an implementation (Clozure CL) or an external library for me. Unfortunately, this was also problematic, because my parser relies on using the same file stream for all reading functions, for example:

 (with-open-file (stream "test.dat" :direction :input :element-type '(unsigned-byte 8)) ;;Read entire contents of NBT file from stream here)

which limits me to the value :element-type '(unsigned-byte 8) and therefore forbids me to specify a character encoding and use read-char (or equivalent) as follows:

 (with-open-file (stream "test.dat" :external-format :utf-8) ...)

:element-type must be '(unsigned-byte 8) so that I can read and write integers and floats of different sizes. To avoid having to manually convert octet sequences to strings, I first wondered if there was a way to change the element type to a symbol character while the file is open, which led me to a discussion here: https://groups.google.com/forum/#! searchin / comp.lang.lisp / binary $ 20write $ 20read / comp.lang.lisp / N0IESNPSPCU / Qmcvtk0HkC0J Apparently, some CL implementations, such as SBCL, use bivalent streams by default, therefore, as read bytes, so and read-char can be used on the same thread; if I took this approach, I would still need to specify :external-format for the stream ( :utf-8 ), although this format should only be used when reading characters, and not when reading raw bytes.

I used several functions from flexi streams in the examples above for brevity, but so far my code uses only built-in stream types, and I have not used flexi streams themselves yet. Is this a good candidate for flexible threads? Having an extra layer of abstraction that allows me to read raw bytes and UTF-8 characters, interchangeable from the same stream, would be ideal.

Any advice from those familiar with flexi streams (or other appropriate approaches) will be greatly appreciated.

Thanks.

+6

stream binary lisp utf-8 common-lisp

Andy page Aug 08 '15 at 19:07

source share

1 answer

Rainer joswig · Answer 1 · 2015-08-08T20:56:48+0000

Here is what I wrote:

First, we want to know how long the encoding of any character actually takes into account the first byte.

 (defun utf-8-number-of-bytes (first-byte) "returns the length of the utf-8 code in number of bytes, based on the first byte. The length can be a number between 1 and 4." (declare (fixnum first-byte)) (cond ((= 0 (ldb (byte 1 7) first-byte)) 1) ((= #b110 (ldb (byte 3 5) first-byte)) 2) ((= #b1110 (ldb (byte 4 4) first-byte)) 3) ((= #b11110 (ldb (byte 5 3) first-byte)) 4) (t (error "unknown number of utf-8 bytes for ~a" first-byte))))

Then we decode:

 (defun utf-8-decode-unicode-character-code-from-stream (stream) "Decodes byte values, from a binary byte stream, which describe a character encoded using UTF-8. Returns the character code and the number of bytes read." (let* ((first-byte (read-byte stream)) (number-of-bytes (utf-8-number-of-bytes first-byte))) (declare (fixnum first-byte number-of-bytes)) (ecase number-of-bytes (1 (values (ldb (byte 7 0) first-byte) 1)) (2 (values (logior (ash (ldb (byte 5 0) first-byte) 6) (ldb (byte 6 0) (read-byte stream))) 2)) (3 (values (logior (ash (ldb (byte 5 0) first-byte) 12) (ash (ldb (byte 6 0) (read-byte stream)) 6) (ldb (byte 6 0) (read-byte stream))) 3)) (4 (values (logior (ash (ldb (byte 3 0) first-byte) 18) (ash (ldb (byte 6 0) (read-byte stream)) 12) (ash (ldb (byte 6 0) (read-byte stream)) 6) (ldb (byte 6 0) (read-byte stream))) 4)) (t (error "wrong UTF-8 encoding for file position ~a of stream ~s" (file-position stream) stream)))))

You know how many characters there are. N characters. You can highlight a line with unicode support for N characters. So you call the function N times. Then, for each result, you convert the result to a character and put it in a string.

Parsing a known length UTF-8 string in Common Lisp one byte at a time

More articles: