How to determine if a string contains multibyte characters in Javascript?

Question

How to determine if a string contains multibyte characters in Javascript?

Is it possible in Javascript to determine if a string contains multibyte characters? If so, can you say which ones?

The problem I am facing is (apologies if the Unicode char is not suitable for you)

s = "𝌆"; alert(s.length); // '2' alert(s.charAt(0)); // '  ' alert(s.charAt(1)); // '  '

Edit for a little clarity here (hopefully). As I understand it, now all the lines in Javascript are represented as a series of UTF-16 code points, which means that ordinary characters actually take 2 bytes (16 bits), so my use of "multibyte" in the header was turned off a bit. Some characters do not fall into the base multilingual plane (BMP), such as the line in the example above, and therefore they occupy two code points (32 bits). This is the question I asked. I also do not edit the original name, since for someone who knows little about this material (and, therefore, will look for SO for information about this), "multibyte" makes sense.

+16

javascript string internationalization multibyte

nickf Feb 02 '11 at 16:56

source share

2 answers

This is my implementation to show more emojis if the message does not contain text

Markup

 <div> <input id="message" placeholder="Nice support for one or multiple emojis"> <button id="post-message">Send</button> <ul id="messages"></ul> </div>

Script

 function jumbotron(str) { return /^[\uD800-\uDFFF]+$/.test(str); } document.getElementById('post-message').onclick = function() { list_element = document.createElement('li'); message = document.getElementById('message').value; list_element_span = document.createElement('span'); list_element_span.innerHTML = message; list_element.appendChild(list_element_span); if (jumbotron(message)) { list_element_span.style.fontSize = '2em'; list_element_span.style.lineHeight = 'normal'; } document.getElementById('messages').appendChild(list_element) }

0

Henrik Albrechtsson Dec 05 '17 at 2:43 on

source share

Tim Down · Accepted Answer · 2011-02-03 10:36

JavaScript strings are encoded in UCS-2 encoding, but can represent Unicode code points outside the base multilingual panel ( U+0000 - U+D7FF and U+E000 - U+FFFF ) using two 16-bit numbers (surrogate pair UTF-16) , the first of which should be in the range U+D800 - U+DFFF .

Based on this, it is easy to determine whether a string contains any characters that lie outside the base multilingual plane (which, I think, you ask: do you want to determine whether the string contains any characters that lie outside the range of code points that JavaScript represents as one character):

 function containsSurrogatePair(str) { return /[\uD800-\uDFFF]/.test(str); } alert( containsSurrogatePair("foo") ); // false alert( containsSurrogatePair("f𝌆") ); // true

Generating exactly what code points are in your string is a bit more complicated and requires a UTF-16 decoder. The following converts the string to an array of Unicode codes:

 var getStringCodePoints = (function() { function surrogatePairToCodePoint(charCode1, charCode2) { return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000; } // Read string in character by character and create an array of code points return function(str) { var codePoints = [], i = 0, charCode; while (i < str.length) { charCode = str.charCodeAt(i); if ((charCode & 0xF800) == 0xD800) { codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i))); } else { codePoints.push(charCode); } ++i; } return codePoints; } })(); alert( getStringCodePoints("f𝌆").join(",") ); // 102,119558

How to determine if a string contains multibyte characters in Javascript?

Markup

Script

More articles: