I am writing a program in Common Lisp for editing created Minecraft binaries that use the NBT format, documented here: http://minecraft.gamepedia.com/NBT_format?cookieSetup=true (I know that such tools exist, for example NBTEditor and MCEdit but none of them are written in Common Lisp, and I thought this project would be a great exercise to learn).
So far, one of the things that I have not been able to implement on my own is a function for reading a UTF-8 string of known length, which contains characters that are represented using more than one octet (that is, characters other than ASCII). In NBT format, each line is encoded in UTF-8 and is preceded by a short (two octets) integer n , indicating the length of the line. Therefore, assuming that only ASCII characters are present in the string, I can simply read the sequence of n octets from the stream and convert them to a string, using this:
(defun read-utf-8-string (string-length byte-stream) (let ((seq (make-array string-length :element-type '(unsigned-byte 8) :fill-pointer t))) (setf (fill-pointer seq) (read-sequence seq byte-stream)) (flexi-streams:octets-to-string seq :external-format :utf-8)))
But if one or more characters have a character code greater than 255, it is encoded in two or more bytes, as shown in this example:
(flexi-streams:string-to-octets "wife" :external-format :utf-8) ==>
Both lines have the same length, but each character of the Russian word is encoded twice by the number of octets, so the total size of the string is two times larger than the English. Knowing the length of a string does not help when using a read sequence. Even if the size of the string (i.e., the number of octets needed to encode it), it was known that there would still be no way to find out which of these octets needed to be converted to a personal form separately and grouped for conversion. So instead of flipping my own function, I tried to find a way to get either an implementation (Clozure CL) or an external library for me. Unfortunately, this was also problematic, because my parser relies on using the same file stream for all reading functions, for example:
(with-open-file (stream "test.dat" :direction :input :element-type '(unsigned-byte 8)) ;;Read entire contents of NBT file from stream here)
which limits me to the value :element-type '(unsigned-byte 8) and therefore forbids me to specify a character encoding and use read-char (or equivalent) as follows:
(with-open-file (stream "test.dat" :external-format :utf-8) ...)
:element-type must be '(unsigned-byte 8) so that I can read and write integers and floats of different sizes. To avoid having to manually convert octet sequences to strings, I first wondered if there was a way to change the element type to a symbol character while the file is open, which led me to a discussion here: https://groups.google.com/forum/#! searchin / comp.lang.lisp / binary $ 20write $ 20read / comp.lang.lisp / N0IESNPSPCU / ββQmcvtk0HkC0J Apparently, some CL implementations, such as SBCL, use bivalent streams by default, therefore, as read bytes, so and read-char can be used on the same thread; if I took this approach, I would still need to specify :external-format for the stream ( :utf-8 ), although this format should only be used when reading characters, and not when reading raw bytes.
I used several functions from flexi streams in the examples above for brevity, but so far my code uses only built-in stream types, and I have not used flexi streams themselves yet. Is this a good candidate for flexible threads? Having an extra layer of abstraction that allows me to read raw bytes and UTF-8 characters, interchangeable from the same stream, would be ideal.
Any advice from those familiar with flexi streams (or other appropriate approaches) will be greatly appreciated.
Thanks.