Determine character position in UTF NSString from byte offset (there were SQLite () offsets and encoding problem)

Story : I have a UTF NSString and a byte offset. I want to know the character byte offset. How can i do

The following is a long story if you dare:

According to this> document offsetsets (), the function returns the byte offset inside the term column. I indexed some text, and I use this offset to point to a specific section of the text when I show the results.

The most important problem is that using this byte offset, I cannot indicate the correct location of this term. Sometimes it points correctly, sometimes it is 3/4 distance from the right point.

My table is very simple:

CREATE VIRTUAL TABLE t1 USING fts4(file, body, page); 

If I execute the request, for example:

 SELECT page, body, offsets(t1) from t1 where body match 'and'; 

I get:

 ........... 502|1 0 427 3 505|1 0 370 3 1 0 1307 3 1 0 1768 3 506|1 0 10 3 1 0 1861 3 1 0 2521 3 ........... 

As an example, if I point to the char 427 of the body, I do not get the correct position of 'and', but I bounce off 2/3 of the characters from it. The same thing if I go to 370, and if I go instead of 10, I get the right position.

Where am I mistaken?

+4
source share
3 answers

See the Sqlite FTS3 docs and you will notice that offsets and lengths in bytes are not characters.

You must apply the offset and length before decoding the bytes into a character string in order to display the correct offset. The offset coming from Sqlite counts each byte of multibyte characters, while you use this offset to count the characters.

Your indexed text probably has 3 or 4 characters, which are two bytes. Therefore, the problem is for-3-or-4.

0
source

In @metatation's answer, the offset is in bytes, not characters. The text in your database is probably UTF8 encoded Unicode, in which case any single character not represented in ASCII represented by several bytes . Examples of non-ASCII characters include accented characters (Γ , ΓΆ, etc.), smart quotes, characters from non-Latin character sets (Greek, Cyrillic, most Asian character sets, etc.) etc.

If the bytes in the SQLite database are UTF8 encoded Unicode strings, you can work out the true Unicode character offset for the given byte offset, for example:

 NSUInteger characterOffsetForByteOffsetInUTF8String(NSUInteger byteOffset, const char *string) { /* * UTF-8 represents ASCII characters in a single byte. Characters with a code * point from U+0080 upwards are represented as multiple bytes. The first byte * always has the two most significant bits set (ie 11xxxxxx). All subsequent * bytes have the most significant bit set, the next most significant bit unset * (ie 10xxxxxx). * * We use that here to determine character offsets. We step through the first * `byteOffset` bytes of `string`, incrementing the character offset result * every time we come across a byte that doesn't match 10xxxxxx, ie where * (byte & 11000000) != 10000000 * * See also: http://en.wikipedia.org/wiki/UTF-8#Description */ NSUInteger characterOffset = 0; for (NSUInteger i = 0; i < byteOffset; i++) { char c = string[i]; if ((c & 0xc0) != 0x80) { characterOffset++; } } return characterOffset; } 

Warning. If you use character offsets for indexing in NSString , remember that NSString uses UTF-16 under the hood, so characters with a Unicode code point above U + FFFF are a pair of 16-bit values. Usually you do not come across this for textual content, but if you are not indifferent to particularly obscure character sets or some of the non-textual characters that Unicode can represent, for example, Emojis, then this algorithm will require improvements for maintenance.

(code snippet from this my project - feel free to use it.)

0
source

Inspired by this thread, and in particular by Simon's decision; this is how i do it.

It may be more "Swifty" than returning NSRange , but I need to allocate it to NSAttributedString .

 extension String { func charRangeForByteRange(range : NSRange) -> NSRange { let bytes = [UInt8](utf8) var charOffset = 0 for i in 0..<range.location { if ((bytes[i] & 0xc0) != 0x80) { charOffset++ } } let location = charOffset for i in range.location..<(range.location + range.length) { if ((bytes[i] & 0xc0) != 0x80) { charOffset++ } } let length = charOffset - location return NSMakeRange(location, length) } } 
0
source

All Articles