String length in bytes in JavaScript

Question

String length in bytes in JavaScript

In my JavaScript code, I need to write a message to the server in this format:

<size in bytes>CRLF <data>CRLF

Example:

 3 foo

Data may contain Unicode characters. I need to send them as UTF-8.

I am looking for the most cross-browser way to calculate string length in bytes in JavaScript.

I tried this to make up the payload:

 return unescape(encodeURIComponent(str)).length + "\n" + str + "\n"

But this does not give me exact results for older browsers (or maybe strings in these browsers in UTF-16?).

Any clues?

Update:

Example: length in bytes of an ! Naïve? string ! Naïve? ! Naïve? UTF-8 is 15 bytes, but some browsers specify 23 bytes instead.

+81

javascript unicode

Alexander Gladysh Apr 01 '11 at 15:59

source share

12 answers

Years passed and now you can do it initially

 (new TextEncoder().encode('foo')).length

Please note that it is not yet supported by IE (or Edge) (you can use polyfill for this).

MDN Documentation

Standard specifications

+81

Riccardo Galli Dec 17 '15 at 10:21

source share

Here is a much faster version that uses neither regular expressions nor encodeURIComponent () :

 function byteLength(str) { // returns the byte length of an utf8 string var s = str.length; for (var i=str.length-1; i>=0; i--) { var code = str.charCodeAt(i); if (code > 0x7f && code <= 0x7ff) s++; else if (code > 0x7ff && code <= 0xffff) s+=2; if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate } return s; }

Here is a performance comparison .

It simply calculates the length in UTF8 of each Unicode code point returned by charCodeAt () (based on WTF8 descriptions on Wikipedia and UTF16 surrogate characters).

This follows RFC3629 (where UTF-8 characters are no more than 4 bytes long).

+57

lovasoa Apr 27 '14 at 9:42 on

source share

For simple UTF-8 encoding, with slightly better compatibility than TextEncoder , Blob does the trick. However, it does not work in very old browsers.

 new Blob(["😀"]).size; // -> 4

+38

simap Mar 09 '17 at 0:41

source share

This function will return the byte size of any UTF-8 string you pass to it.

 function byteCount(s) { return encodeURI(s).split(/%..|./).length - 1; }

Source

+29

Lauri Oherd Aug 30 '12 at 18:56

source share

Another very simple approach using Buffer (NodeJS only):

 Buffer.from(string).length

+14

Iván Pérez Sep 20 '17 at 11:43 on

source share

Actually, I realized what happened. For the code to work, the <head> page must have this tag:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Or, as suggested in the comments, if the server sends an HTTP Content-Encoding header, it should work too.

Then the results from different browsers are consistent.

Here is an example:

 <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>mini string length test</title> </head> <body> <script type="text/javascript"> document.write('<div style="font-size:100px">' + (unescape(encodeURIComponent("! Naïve?")).length) + '</div>' ); </script> </body> </html>

Note. I suspect that specifying any (exact) encoding will fix the encoding problem. Just a coincidence I need UTF-8.

+4

Alexander Gladysh Apr 01 '11 at 16:25

source share

Here is an independent and efficient method for counting bytes of UTF-8 strings.

 //count UTF-8 bytes of a string function byteLengthOf(s){ //assuming the String is UCS-2(aka UTF-16) encoded var n=0; for(var i=0,l=s.length; i<l; i++){ var hi=s.charCodeAt(i); if(hi<0x0080){ //[0x0000, 0x007F] n+=1; }else if(hi<0x0800){ //[0x0080, 0x07FF] n+=2; }else if(hi<0xD800){ //[0x0800, 0xD7FF] n+=3; }else if(hi<0xDC00){ //[0xD800, 0xDBFF] var lo=s.charCodeAt(++i); if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF] n+=4; }else{ throw new Error("UCS-2 String malformed"); } }else if(hi<0xE000){ //[0xDC00, 0xDFFF] throw new Error("UCS-2 String malformed"); }else{ //[0xE000, 0xFFFF] n+=3; } } return n; } var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF"; console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

Note that the method may cause an error if the input string is UCS-2 malformed

+3

fuweichin Jan 21 '16 at 9:56 on

source share

It took me a while to find a solution for React Native, so I will post it here:

First install the buffer package:

 npm install --save buffer

Then use the node method:

 const { Buffer } = require('buffer'); const length = Buffer.byteLength(string, 'utf-8');

+3

laurent Feb 15 '18 at 18:01

source share

This will work for BMP and SIP / SMP characters.

  String.prototype.lengthInUtf8 = function() { var asciiLength = this.match(/[\u0000-\u007f]/g) ? this.match(/[\u0000-\u007f]/g).length : 0; var multiByteLength = encodeURI(this.replace(/[\u0000-\u007f]/g)).match(/%/g) ? encodeURI(this.replace(/[\u0000-\u007f]/g, '')).match(/%/g).length : 0; return asciiLength + multiByteLength; } 'test'.lengthInUtf8(); // returns 4 '\u{2f894}'.lengthInUtf8(); // returns 4 'سلام علیکم'.lengthInUtf8(); // returns 19, each Arabic/Persian alphabet character takes 2 bytes. '你好，JavaScript 世界'.lengthInUtf8(); // returns 26, each Chinese character/punctuation takes 3 bytes.

+1

chrislau Dec 21 '16 at 19:17

source share

In NodeJS, Buffer.byteLength is a method specifically for this purpose:

 let strLengthInBytes = Buffer.byteLength(str); // str is UTF-8

Note that by default, the method assumes that the string is in UTF-8 encoding. If another encoding is required, pass it as the second argument.

+1

Boaz May 7 '19 at 15:40

source share

You can try the following:

 function getLengthInBytes(str) { var b = str.match(/[^\x00-\xff]/g); return (str.length + (!b ? 0: b.length)); }

This works for me.

0

anh tran Jan 10 '13 at 4:48

source share

Mike Samuel · Accepted Answer · 2011-04-01 16:07

~~There is no way to do this in JavaScript natively.~~ (See Riccardo Galli's answer for a modern approach.)

For historical reference or in cases where the TextEncoder APIs are still unavailable .

If you know the character encoding, you can calculate it yourself.

encodeURIComponent assumes UTF-8 as the character encoding, so if you need this encoding, you can do it,

 function lengthInUtf8Bytes(str) { // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence. var m = encodeURIComponent(str).match(/%[89ABab]/g); return str.length + (m ? m.length : 0); }

This should work because of the way UTF-8 encodes multibyte sequences. The first encoded byte always starts either with the most significant bit of zero for one sequence of bytes, or with a byte whose first hexadecimal digit is C, D, E or F. The second and subsequent bytes are those whose first two bits are equal to 10. These are those extra bytes. which you want to read in UTF-8.

Wikipedia table clarifies the situation

 Bits Last code point Byte 1 Byte 2 Byte 3 7 U+007F 0xxxxxxx 11 U+07FF 110xxxxx 10xxxxxx 16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx ...

If instead you need to understand the encoding of the page, you can use this trick:

 function lengthInPageEncoding(s) { var a = document.createElement('A'); a.href = '#' + s; var sEncoded = a.href; sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1); var m = sEncoded.match(/%[0-9a-f]{2}/g); return sEncoded.length - (m ? m.length * 2 : 0); }

String length in bytes in JavaScript

More articles: