First of all, note that there are no functions in the Lua string library that do not know anything about Unicode / mutlibyte encodings (source: Programming in Lua, 3rd edition). As for Lua, strings just consist of bytes. It's up to you what bytes make up a character if you use UTF-8 encoded strings. Therefore string.len will give you the number of bytes, not the number of characters. And string.sub will give you a substring of bytes, not a substring of characters.
Some basics of UTF-8:
If you need to brush up on Unicode conceptual frameworks, you should check out this article .
UTF-8 is one possible (and very important) implementation of Unicode - and perhaps the one you're dealing with. Unlike UTF-32 and UTF-16, a variable number of bytes (1 to 4) is used to encode each character. In particular, ASCII characters from 0 to 127 are represented by a single byte, so ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you use only these 128 characters). All other characters begin with a byte in the range of 194 to 244 (which means that more bytes are required to encode a complete character). This range is further subdivided, so that you can indicate from this byte whether 1, 2 or 3 bytes follow. These additional bytes are called continuation bytes and are guaranteed to be taken only from the range from 128 to 191. Therefore, looking at one byte, we know where it is in the character:
- If it is in
[0,127] , it is a single-byte character (ASCII) - If it is in
[128,191] , it is part of a longer character and is meaningless in itself - If it is in
[191,244] , it marks the beginning of a longer character (and tells us how long this character is)
This information is enough to count the characters, split the UTF-8 string into characters and do all kinds of manipulations with UTF-8.
Some pattern matching patterns:
For this task, we need several constructs that match the Lua pattern:
[...] is a character class that corresponds to one character (or rather, a byte) of those that are inside the class. For instance. [abc] matches either a , b or c . You can define ranges with a hyphen. Therefore, [\33-\127] , for example, matches any one of the bytes 33 through 127 . Note that \127 is an escape sequence that you can use on any Lua line (and not just for patterns) to specify a byte by its numeric value instead of the corresponding ASCII character. For example, "a" matches "\97" .
You can undo a character class by starting it with ^ (so that it matches a single byte that is not part of the class.
* repeats the previous token 0 or more times (arbitrarily many times - as often as possible).
$ is an anchor. If this is the last character of the pattern, the pattern will only match at the end of the line.
The combination of all this ...
... your problem boils down to a single line:
local function lastChar(s) return string.match(s, "[^\128-\191][\128-\191]*$") end
This will correspond to a character that is not a continuation character of UTF-8 (i.e. it is either a single-byte character or a byte that marks the beginning of a longer character). Then it corresponds to an arbitrary number of continuation characters (this cannot go past the current character due to the selected range), followed by the end of the line ( $ ). Therefore, this will give you all the bytes that make up the last character in the string. It displays the desired result for all 4 of your examples.
Equivalently, you can use gsub to remove the last character from your string:
function deleteLastCharacter(s) return string.gsub(s, "[^\128-\191][\128-\191]*$", "") end
Same match, but instead of returning a matched substring, we replace it with "" (i.e. delete it) and return the modified string.