Extract the first letter of a UTF-8 string using Lua

Is there a way to extract the first letter of a UTF-8 encoded string using Lua?

Lua string.sub("ÆØÅ", 2, 2) not support Unicode correctly, so string.sub("ÆØÅ", 2, 2) will return "?" , not "Ø" .

Is there a relatively simple UTF-8 parsing algorithm that I could use for a string byte for each byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or A?

Or is this way too complicated, requiring a huge library, etc.?

+7
source share
2 answers

You can easily extract the first letter from a UTF-8 encoded string with the following code:

 function firstLetter(str) return str:match("[%z\1-\127\194-\244][\128-\191]*") end 

Because the UTF-8 code point begins with a byte from 0 to 127 or with a byte from 194 to 244, followed by one or more bytes from 128 to 191.

You can even iterate over UTF-8 code points in the same way:

 for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do print(code) end 

Note that both examples return a string value for each letter, not a numerical value for Unicode code.

+16
source

Lua 5.3 provides the UTF-8 library .

You can use utf8.codes to get each code point, and then use utf8.char to get the character:

 local str = "ÆØÅ" for _, c in utf8.codes(str) do print(utf8.char(c)) end 

This also works:

 local str = "ÆØÅ" for w in str:gmatch(utf8.charpattern ) do print(w) end 

where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for a pattern that matches a single sequence of UTF-8 bytes.

+2
source

All Articles