Parsing ASCII characters with Erlang

Question

Parsing ASCII characters with Erlang

What confuses what needs to be done is parsing and at what end is the client / server.

When i send an Umlaut 'Ö' to my ejabberd, it is received by ejabberd as <<"195, 150">>

After that, I will send this to my client in the form of push notifications (via GCM / APNS). From there, the client builds UTF-8 decoding for each digit one at a time (this is wrong).

 ie 195 is first decoded to gibberish character   and so on.

This reconstruction needs identification if you need to use two bytes or 3 or more. It depends on the language of the letters (German here, for example).

How does the client identify which language it is going to recover (the number of bytes to decode at a time)?

To add more

 lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).

Produced by json as:

 "{\"registration_ids\":[\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}"

where I gave “hello” as a message, and mochijson gave me ASCII values [104, 105].

 The groupname field was given the value "Groupname", the ASCIIs are also correct after json creation ie 71,114,111,117,112,78,97,109,101

However, when I use http://www.unit-conversion.info/texttools/ascii/

 It is decodes as Ǎo  me and not "Groupname".

So who should understand? How the same should be handled.

My reconstructed message is all gibberuish when ASCII is reconstructed.

thanks

+3

android erlang xmpp apple-push-notifications ejabberd

ankitrana_ Jun 19 '15 at 5:20

source share

1 answer

I GIVE TERRIBLE ADVICE · Answer 1 · 2015-06-20T14:12:49+0000

All you need to worry about is meaningful and related to the desired encoding or data structure. Erlang processes text in one of the following ways:

lists of bytes ( [0..255, ...] )
- This is what you get if you are listening on a socket and the data is returned as a list.
- VM does not require coding. They are bytes and a bit more.
- However, the VM can interpret them as strings (for example, in io:format("~s~n", [List]) ). When this happens (with a specific ~s flag), VM assumes the encoding is latin-1 (ISO-8859-1).
Unicode code lists ( [0..1114111, ...] ).
- You can get files from files that are considered unicode and as a list.
- You can use them in the output if you have a formatter, for example io:format("~ts~n", [List]) , where ~ts is like ~s , but like unicode.
- These lists are code pages that you see in the unicode standard, without any encoding (they are not UTF-x )
- This can work in conjunction with Latin character lists, because Unicode and latin1 characters have the same sequence numbers below 255.
Binaries ( <<0..255, ...>> )
- This is what you get if you are listening to or reading / from anything in binary format.
- In VM, you can say that it takes a lot of things:
  - These are byte sequences ( 0..255 ) without a specific value ( <<Bin/binary>> )
  - These are utf-8 encoded sequences ( <<Bin/utf-8>> )
  - They are utf-16 encoded sequences ( <<Bin/utf-16>> )
  - These are utf-32 encoded sequences ( <<Bin/utf-32>> )
- io:format("~s~n", [Bin]) will still assume that any sequence is a Latin sequence; io:format("~ts~n", [Bin]) will only accept UTF-8 .
A mixed list of both unicode lists and binary files with the utf extension (known as iodata() ), used exclusively for output.

So, at the heart of:

byte lists
lists of latin characters
Unicode Code Lists
binary byte code
utf-8 binary
utf-16 binary
utf-32 binary
lists of many for quick concatenation

It should also be noted: prior to version 17.0, all Erlang source files were only Latin. 17.0 added that the compiler reads your source file as unicode, adding this header:

 %% -*- coding: utf-8 -*-

The next factor is that, by specification, JSON accepts UTF-8 as the encoding for everything it has. In addition, Erlang's JSON libraries will tend to assume that binary is a string, and that JSON arrays are listed.

This means that if you want your output to be adequate, you must use UTF-8 encoded encodings to represent any JSON.

If you have:

A list of bytes representing the string encoded in utf, then list_to_binary(List) to get the correct binary representation
Code point list, then use unicode:characters_to_binary(List, unicode, utf8) to get utf-8 encoded binary
A binary file representing the string latin-1: unicode:characters_to_binary(Bin, latin1, utf8)
The binary code of any other UTF encoding: unicode:characters_to_binary(Bin, utf16 | utf32, utf8)

Take this UTF-8 binary and send it to the JSON library. If the JSON library is correct and the client parses it correctly, then it must be correct.

Parsing ASCII characters with Erlang

More articles: