The built-in string library treats Lua strings as arrays of bytes. An alternative that works with multibyte (Unicode) characters is the unicode library that originated in the Selene project. Its main selling point is that it can be used as a replacement for the string library, making most string operations "magic", Unicode capable.
If you prefer not to add third-party dependencies, your task can easily be done using LPeg . Here is an example splitter:
local lpeg = require "lpeg" local C, Ct, R = lpeg.C, lpeg.Ct, lpeg.R local lpegmatch = lpeg.match local split_utf8 do local utf8_x = R"\128\191" local utf8_1 = R"\000\127" local utf8_2 = R"\194\223" * utf8_x local utf8_3 = R"\224\239" * utf8_x * utf8_x local utf8_4 = R"\240\244" * utf8_x * utf8_x * utf8_x local utf8 = utf8_1 + utf8_2 + utf8_3 + utf8_4 local split = Ct (C (utf8)^0) * -1 split_utf8 = function (str) str = str and tostring (str) if not str then return end return lpegmatch (split, str) end end
This snippet defines a split_utf8() function that creates a UTF8 character table (like Lua strings) but returns nil if the string is not a valid UTF sequence. You can run this test code:
tests = { en = [[Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu.(w)ɐ] meaning moon; ]] .. [[explicitly not "LUA"[1]) is a lightweight multi-paradigm programming ]] .. [[language designed as a scripting language with "extensible ]] .. [[semantics" as a primary goal.]], ru = [[Lua ([́], . «») — , ]] .. [[ Tecgraf ]] .. [[--.]], gr = [[Η Lua είναι μια ελαφρή προστακτική γλώσσα προγραμματισμού, που ]] .. [[σχεδιάστηκε σαν γλώσσα σεναρίων με κύριο σκοπό τη δυνατότητα ]] .. [[επέκτασης της σημασιολογίας της.]], XX = ">\255< invalid" } ------------------------------------------------------------------------------- local limit = 14 for lang, str in next, tests do io.write "\n" io.write (string.format ("<%s %3d> ->", lang, #str)) local chars = split_utf8 (str) if not chars then io.write " INVALID!" else io.write (string.format (" <%3d>", #chars)) for i = 1, #chars > limit and limit or #chars do io.write (string.format (" %q", chars [i])) end end end io.write "\n"
Btw., Building a table with LPeg is significantly faster than calling table.insert() several times. Here are the statistics for splitting the entire Gogol of Dead Souls (in Russian, 1023814 bytes raw, 571395 UTF characters) on my machine:
library method time in ms string table.insert() 380 string t [#t + 1] = c 310 string gmatch & for loop 280 slnunicode table.insert() 220 slnunicode t [#t + 1] = c 200 slnunicode gmatch & for loop 170 lpeg Ct (C (...)) 70