Javascript: Unicode character for BYTE based on hexadecimal escape sequence (NOT surrogates)

Question

Javascript: Unicode character for BYTE based on hexadecimal escape sequence (NOT surrogates)

In javascript, I am trying to make unicode into a byte sequence of hexadecimal escape sequences that are compatible with C:

T. 😄

becomes: \xF0\x9F\x98\x84 (correct)

NOT javascript surrogates, not \uD83D\uDE04 (incorrect)

I can not calculate the mathematical relationship between the four bytes C wants against the two surrogates used by javascript. I suspect that the algorithm is much more complicated than my weak attempts.

Thanks for any advice.

+4

javascript unicode utf-8 utf-16 hex

ck_ Aug 1 '15 at 12:44

source share

3 answers

Your C code expects a UTF-8 string (character is represented as 4 bytes). The presented JS representation that you see is UTF-16 , however (the symbol is represented as 2 uint16 s, a surrogate pair).
First you need to get the code point (Unicode) for your character (from the JS UTF-16 string), and then create a UTF-8 representation for it.

With ES6, you can use the codePointAt method for the first part, which I would recommend using as a pad, even if it is not supported. I think you do not want to decode surrogate pairs yourself :-)
Otherwise, I don't think there is a library method, but you can write it according to the specification:

 function hex(x) { x = x.toString(16); return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase(); } var c = "😄"; console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04" var cp = c.codePointAt(0); var bytes = new Uint8Array(4); bytes[3] = 0x80 | cp & 0x3F; bytes[2] = 0x80 | (cp >>>= 6) & 0x3F; bytes[1] = 0x80 | (cp >>>= 6) & 0x3F; bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F; console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"

_{(tested in Chrome)}

+1

Bergi Aug 1 '15 at 13:06

source share

Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/

I would never understand this math, wow.

somewhat miniature

 function UTF8seq(s) { var i,c,u=[]; for (i=0; i < s.length; i++) { c = s.charCodeAt(i); if (c < 0x80) { u.push(c); } else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); } else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); } else { i++; c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff)); u.push(0xf0 | (c >>18), 0x80 | ((c>>12) & 0x3f), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); } } for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); } return '\\x'+u.join('\\x'); }

0

ck_ Aug 1 '15 at 13:09

source share

Artem · Accepted Answer · 2015-08-01T13:20:08+0000

encodeURIComponent works:

 var input = "\uD83D\uDE04"; var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84

Update: Actually, C strings can contain numbers and letters without escaping, but if you really need to avoid them:

 function escape(s, escapeEverything) { if (escapeEverything) { s = s.replace(/[\x10-\x7f]/g, function (s) { return "-x" + s.charCodeAt(0).toString(16).toUpperCase(); }); } s = encodeURIComponent(s).replace(/%/g, "\\x"); if (escapeEverything) { s = s.replace(/\-/g, "\\"); } return s; }

Javascript: Unicode character for BYTE based on hexadecimal escape sequence (NOT surrogates)

More articles: