Lua - read one UTF-8 character from a file

Question

Lua - read one UTF-8 character from a file

Is it possible to read one UTF-8 character from a file?

file: read (1) return weird characters instead when I print it.

function firstLetter(str) return str:match("[%z\1-\127\194-\244][\128-\191]*") end

The function returns one UTF-8 character from the string str. I need to read one UTF-8 character in this way, but from the input file (do not want to read a specific file in memory - through the file: read ("* all"))

The question is pretty similar to this post: Extract the first letter of a UTF-8 string using Lua

+5

file lua encoding utf-8 character-encoding

Hrablicky Apr 24 '15 at 19:45

source share

3 answers

Egor skriptunoff · Answer 1 · 2015-04-24T20:46:28+0000

 function read_utf8_char(file) local c1 = file:read(1) local ctr, c = -1, math.max(c1:byte(), 128) repeat ctr = ctr + 1 c = (c - 128)*2 until c < 128 return c1..file:read(ctr) end

Paul kulchenko · Answer 2 · 2015-04-24T20:48:53+0000

You need to read the characters so that the line you are in always has four or more of them (which allows you to apply the logic from the answer you are referring to). If, after matching and deleting the UTF-8 character, the length is len , then you read 4-len characters from the file.

ZeroBrane Studio replaces invalid UTF-8 characters with the [SYN] character when printing on the output panel (as shown in the screenshot). This blogpost describes the logic for detecting invalid UTF-8 characters (in Lua) and their processing in ZeroBrane Studio.

hugomg · Answer 3 · 2016-08-12T00:12:28+0000

In UTF-8 encoding, the number of bytes received for a character is determined by the first byte of that character, in accordance with the following table (taken from RFC 3629 :

 Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If the most significant bit of the first byte is "0", the character has only one byte. If the most significant bit is "110", the character has 2 bytes, etc.

Then you can read one byte from the file and determine how many remaining bytes you need to read for the full UTF-8 character:

 function get_one_utf8_character(file) local c1 = file:read(1) if not c1 then return nil end local ncont if c1:match("[\000-\127]") then ncont = 0 elseif c1:match("[\192-\223]") then ncont = 1 elseif c1:match("[\224-\239]") then ncont = 2 elseif c1:match("[\240-\247]") then ncont = 3 else return nil, "invalid leading byte" end local bytes = { c1 } for i=1,ncont do local ci = file:read(1) if not (ci and ci:match("[\128-\191]")) then return nil, "expected continuation byte" end bytes[#bytes+1] = ci end return table.concat(bytes) end

Lua - read one UTF-8 character from a file

More articles: