Parsing ASCII characters with Erlang

What confuses what needs to be done is parsing and at what end is the client / server.

When i send an Umlaut 'Γ–' to my ejabberd, it is received by ejabberd as <<"195, 150">> 

After that, I will send this to my client in the form of push notifications (via GCM / APNS). From there, the client builds UTF-8 decoding for each digit one at a time (this is wrong).

 ie 195 is first decoded to gibberish character   and so on. 

This reconstruction needs identification if you need to use two bytes or 3 or more. It depends on the language of the letters (German here, for example).

How does the client identify which language it is going to recover (the number of bytes to decode at a time)?

To add more

 lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})). 

Produced by json as:

 "{\"registration_ids\":[\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}" 

where I gave β€œhello” as a message, and mochijson gave me ASCII values ​​[104, 105].

 The groupname field was given the value "Groupname", the ASCIIs are also correct after json creation ie 71,114,111,117,112,78,97,109,101 

However, when I use http://www.unit-conversion.info/texttools/ascii/

 It is decodes as Ǎo  me and not "Groupname". 

So who should understand? How the same should be handled.

My reconstructed message is all gibberuish when ASCII is reconstructed.

thanks

+3
source share
1 answer

All you need to worry about is meaningful and related to the desired encoding or data structure. Erlang processes text in one of the following ways:

  • lists of bytes ( [0..255, ...] )
    • This is what you get if you are listening on a socket and the data is returned as a list.
    • VM does not require coding. They are bytes and a bit more.
    • However, the VM can interpret them as strings (for example, in io:format("~s~n", [List]) ). When this happens (with a specific ~s flag), VM assumes the encoding is latin-1 (ISO-8859-1).
  • Unicode code lists ( [0..1114111, ...] ).
    • You can get files from files that are considered unicode and as a list.
    • You can use them in the output if you have a formatter, for example io:format("~ts~n", [List]) , where ~ts is like ~s , but like unicode.
    • These lists are code pages that you see in the unicode standard, without any encoding (they are not UTF-x )
    • This can work in conjunction with Latin character lists, because Unicode and latin1 characters have the same sequence numbers below 255.
  • Binaries ( <<0..255, ...>> )
    • This is what you get if you are listening to or reading / from anything in binary format.
    • In VM, you can say that it takes a lot of things:
      • These are byte sequences ( 0..255 ) without a specific value ( <<Bin/binary>> )
      • These are utf-8 encoded sequences ( <<Bin/utf-8>> )
      • They are utf-16 encoded sequences ( <<Bin/utf-16>> )
      • These are utf-32 encoded sequences ( <<Bin/utf-32>> )
    • io:format("~s~n", [Bin]) will still assume that any sequence is a Latin sequence; io:format("~ts~n", [Bin]) will only accept UTF-8 .
  • A mixed list of both unicode lists and binary files with the utf extension (known as iodata() ), used exclusively for output.

So, at the heart of:

  • byte lists
  • lists of latin characters
  • Unicode Code Lists
  • binary byte code
  • utf-8 binary
  • utf-16 binary
  • utf-32 binary
  • lists of many for quick concatenation

It should also be noted: prior to version 17.0, all Erlang source files were only Latin. 17.0 added that the compiler reads your source file as unicode, adding this header:

 %% -*- coding: utf-8 -*- 

The next factor is that, by specification, JSON accepts UTF-8 as the encoding for everything it has. In addition, Erlang's JSON libraries will tend to assume that binary is a string, and that JSON arrays are listed.

This means that if you want your output to be adequate, you must use UTF-8 encoded encodings to represent any JSON.

If you have:

  • A list of bytes representing the string encoded in utf, then list_to_binary(List) to get the correct binary representation
  • Code point list, then use unicode:characters_to_binary(List, unicode, utf8) to get utf-8 encoded binary
  • A binary file representing the string latin-1: unicode:characters_to_binary(Bin, latin1, utf8)
  • The binary code of any other UTF encoding: unicode:characters_to_binary(Bin, utf16 | utf32, utf8)

Take this UTF-8 binary and send it to the JSON library. If the JSON library is correct and the client parses it correctly, then it must be correct.

+5
source

All Articles