Detect if last character is not multibyte in Lua

First question. The easiest way in Lua is to determine if the last character in a string is not multibyte. Or the easiest way to remove the last character from a string.

Here are examples of valid strings and what I want the function output to be

hello there --- result should be: hello ther anรฑ --- result should be: an ื›ืจืืข --- result should be: ื›ืจื ใ…Žใ„นใ…‡ใ…‡ใ…… --- result should be: ใ…Žใ„นใ…‡ใ…‡ 

I need something like

 function lastCharacter(string) --- some code which will extract the last character only --- return lastChar end 

or if it's easier

 function deleteLastCharacter(string) --- some code which will output the string minus the last character --- return newString end 

This is the way I walked

 local function lastChar(string) local stringLength = string.len(string) local lastc = string.sub(string,stringLength,stringLength) if lastc is a multibyte character then local wordTable = {} for word in string:gmatch("[\33-\127\192-\255]+[\128-\191]*") do wordTable[#wordTable+1] = word end lastc = wordTable[#wordTable] end return lastc end 
+4
source share
3 answers

First of all, note that there are no functions in the Lua string library that do not know anything about Unicode / mutlibyte encodings (source: Programming in Lua, 3rd edition). As for Lua, strings just consist of bytes. It's up to you what bytes make up a character if you use UTF-8 encoded strings. Therefore string.len will give you the number of bytes, not the number of characters. And string.sub will give you a substring of bytes, not a substring of characters.

Some basics of UTF-8:

If you need to brush up on Unicode conceptual frameworks, you should check out this article .

UTF-8 is one possible (and very important) implementation of Unicode - and perhaps the one you're dealing with. Unlike UTF-32 and UTF-16, a variable number of bytes (1 to 4) is used to encode each character. In particular, ASCII characters from 0 to 127 are represented by a single byte, so ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you use only these 128 characters). All other characters begin with a byte in the range of 194 to 244 (which means that more bytes are required to encode a complete character). This range is further subdivided, so that you can indicate from this byte whether 1, 2 or 3 bytes follow. These additional bytes are called continuation bytes and are guaranteed to be taken only from the range from 128 to 191. Therefore, looking at one byte, we know where it is in the character:

  • If it is in [0,127] , it is a single-byte character (ASCII)
  • If it is in [128,191] , it is part of a longer character and is meaningless in itself
  • If it is in [191,244] , it marks the beginning of a longer character (and tells us how long this character is)

This information is enough to count the characters, split the UTF-8 string into characters and do all kinds of manipulations with UTF-8.

Some pattern matching patterns:

For this task, we need several constructs that match the Lua pattern:

[...] is a character class that corresponds to one character (or rather, a byte) of those that are inside the class. For instance. [abc] matches either a , b or c . You can define ranges with a hyphen. Therefore, [\33-\127] , for example, matches any one of the bytes 33 through 127 . Note that \127 is an escape sequence that you can use on any Lua line (and not just for patterns) to specify a byte by its numeric value instead of the corresponding ASCII character. For example, "a" matches "\97" .

You can undo a character class by starting it with ^ (so that it matches a single byte that is not part of the class.

* repeats the previous token 0 or more times (arbitrarily many times - as often as possible).

$ is an anchor. If this is the last character of the pattern, the pattern will only match at the end of the line.

The combination of all this ...

... your problem boils down to a single line:

 local function lastChar(s) return string.match(s, "[^\128-\191][\128-\191]*$") end 

This will correspond to a character that is not a continuation character of UTF-8 (i.e. it is either a single-byte character or a byte that marks the beginning of a longer character). Then it corresponds to an arbitrary number of continuation characters (this cannot go past the current character due to the selected range), followed by the end of the line ( $ ). Therefore, this will give you all the bytes that make up the last character in the string. It displays the desired result for all 4 of your examples.

Equivalently, you can use gsub to remove the last character from your string:

 function deleteLastCharacter(s) return string.gsub(s, "[^\128-\191][\128-\191]*$", "") end 

Same match, but instead of returning a matched substring, we replace it with "" (i.e. delete it) and return the modified string.

+8
source

Here is another way to do this; it shows how to iterate over a character string in utf8:

 function butlast (str) local i,j,k = 1,0,-1 while true do s,e = string.find(str,".[\128-\191]*",i) if s then k = j j = e i = e + 1 else break end end return string.sub(str,1,k) end 

Using an example:

 > return butlast"ื›ืจืืข" ื›ืจื > return butlast"ใ…Žใ„นใ…‡ใ…‡ใ……" ใ…Žใ„นใ…‡ใ…‡ > return butlast"anรฑ" an > return butlast"hello there" hello ther > 
+4
source

Switching to prapin solution here :

 function lastCharacter(str) return str:match("[%z\1-\127\194-\244][\128-\191]*$") end 

Then you can get the length of the return value to find out if it is a lot or not; you can also remove it from the line using the gsub function:

 function deleteLastCharacter(str) -- make sure to add "()" around gsub to force it to return only one value return(str:gsub("[%z\1-\127\194-\244][\128-\191]*$", "")) end for _, str in pairs{"hello there", "anรฑ", "ื›ืจืืข"} do print(str, " -->-- ", deleteLastCharacter(str)) end 

Please note that these patterns only work with valid UTF-8 strings . If you have possibly incorrect code, you may need to apply more complex logic .

+3
source

All Articles