SMS messages without ASCII character encoding

I have a Nokia N900 phone, and when sending an SMS, the widget displays the number of characters remaining in the message (and the number of actual short messages needed to send the entire message).

I live in France, where I noticed the following unusual thing when writing messages with non-ASCII characters:

  • some non-ASCII characters are encoded on the same char / byte, for example. "é", "è", "à", "ù"
  • the presence of some non-ASCII characters, such as "ç", "ê", "ô", consumes a fixed amount of 90 char / bytes + 1 bytes per character
  • the presence of the second "ç", "ê", etc. consumes only 1 extra byte.

So, I am interested in how messages are encoded, because I do not see this scheme corresponding to the traditional encodings that I know (iso-8859-1, UTF-8, UTF-16 ...).

+7
source share
3 answers

https://en.wikipedia.org/wiki/SMS#Message_size

Depending on the encoding, SMS can send 160/140/70 characters. If any of the non-ASCII characters is used, the entire message must be encoded in UTF-16, hence the “consumption” you experienced.

+10
source

@Vicky and @timdream are right, except that I consider it technically UCS-2 , not UTF-16, which is sometimes used by a phone that has a fixed size of 16 bits per character. UTF-16 uses a variable width of two or four bytes per character, depending on the character being encoded. This Wikipedia article explains this in detail. UCS-2 strictly takes a message up to 70 characters maximum (160 bytes). Although the Unicode description of the UCS-2 consortium is a bit confusing, several SMS sites on the Internet confirm that Wikipedia is right.

+6
source

You already have a response from @timdream, but only an additional point is that some of the extended characters that you mentioned are included in the GSM 7-bit alphabet as separate characters, some of them are encoded in GSM 7-bit via an additional output (so which is two bytes to represent this character), and some cannot be encoded at all into a 7-bit GSM bit and must be encoded as UTF-16.

The full definition of the alphabet is here: http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT

Pay attention to the special feature of c-cedilla - from this file,

The ETSI GSM 03.38 specification shows a capital C-cedilla glyph at 0x09. This may be the result of a limited display of character handling capabilities with descenders. However, the intent is to cover the language explicitly for lowercase c-cedilla, as shown in the mapping below. The uppercase mapping for C-cedilla is shown in the comments in the mapping table.

Some devices encode both upper and lower case c-cedilla as the same encoded character (0x09).

+5
source

All Articles