Conversion between UTF-8 ArrayBuffer and String

Question

I have an ArrayBuffer which contains a string encoded using UTF-8 and I can't find a standard way of converting such ArrayBuffer into a JS String (which I understand is encoded using UTF-16).

I've seen this code in numerous places, but I fail to see how it would work with any UTF-8 code points that are longer than 1 byte.

return String.fromCharCode.apply(null, new Uint8Array(data));

Similarly, I can't find a standard way of converting from a String to a UTF-8 encoded ArrayBuffer.

@LightStyle Thanks, completely missed that spelling mistake! :P — Tom Leese
– Tom Leese, Commented Jun 19, 2013 at 13:06
var uintArray = new Uint8Array("string".split('').map(function(char) {return char.charCodeAt(0);})); — Niccolò Campolungo
– Niccolò Campolungo, Commented Jun 19, 2013 at 13:10
It that is what you need I can explain you in an answer, otherwise I can keep only the comment ;) — Niccolò Campolungo
– Niccolò Campolungo, Commented Jun 19, 2013 at 13:16
Will that definitely work on UTF code points that are longer than 1 byte? — Tom Leese
– Tom Leese, Commented Jun 19, 2013 at 13:19
The one-liner you posted will decode bytes in the range 0x00–0xFF to their corresponding Unicode code points U+0000–U+00FF. In other words, it can’t represent anywhere near the whole Unicode range. However, it just so happens that Unicode code points U+0000–U+00FF correspond exactly to ISO 8859-1 (Latin 1), so what you have written is in effect an ISO 8859-1 decoder. LightStyle’s oneliner is the encoder that corresponds to the decoder in the question. In other words, it is an ISO 8859-1 encoder. — Daniel Cassidy
– Daniel Cassidy, Commented Mar 24, 2014 at 14:40

LWC · Accepted Answer · 2022-07-01 11:36:18Z

111

Using TextEncoder and TextDecoder

var uint8array = new TextEncoder("utf-8").encode("Plain Text");
var string = new TextDecoder().decode(uint8array);
console.log(uint8array ,string )

edited Jul 1, 2022 at 11:36

LWC

1,2351 gold badge15 silver badges37 bronze badges

answered Dec 16, 2016 at 8:47

PPB

3,1073 gold badges19 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Benproductions1 Over a year ago

Support for this feature is sorely lacking in IE and Edge.

PeterS Over a year ago

And for some reason there is only a polyfill for TextEncoder, I'm assuming TextDecoding just simply wouldn't work in IE right now.

MaMazav Over a year ago

Notice that TextEncoder c`tor doesn't accept any argument (it's always utf-8, no matter what you pass in). However the decoder does accept argument (both the documentation and how it works practically aligns with this).

Qix - MONICA WAS MISTREATED Over a year ago

@JosephGarrone "plain text" isn't a term that is restricted to cryptography...

uryga Over a year ago

For anyone coming across this question in 2021, every major browser supports TextEncoder/Decoder now: caniuse.com/textencoder

|

Anna · Accepted Answer · 2019-11-02 19:56:55Z

46

function stringToUint(string) {
    var string = btoa(unescape(encodeURIComponent(string))),
        charList = string.split(''),
        uintArray = [];
    for (var i = 0; i < charList.length; i++) {
        uintArray.push(charList[i].charCodeAt(0));
    }
    return new Uint8Array(uintArray);
}

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(atob(encodedString)));
    return decodedString;
}

I have done, with some help from the internet, these little functions, they should solve your problems! Here is the working JSFiddle.

EDIT:

Since the source of the Uint8Array is external and you can't use atob you just need to remove it(working fiddle):

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(encodedString));
    return decodedString;
}

Warning: escape and unescape is removed from web standards. See this.

edited Nov 2, 2019 at 19:56

Anna

3595 silver badges20 bronze badges

answered Jun 19, 2013 at 13:42

Niccolò Campolungo

12k4 gold badges35 silver badges40 bronze badges

11 Comments

Esailija Over a year ago

atob/btoa do base64 encoding/decoding, if you pass a honest utf8 byte array, it won't work: jsfiddle.net/Z9pQE/1

Niccolò Campolungo Over a year ago

It is planned to work only with an UintArray of an encoded string, otherwise it is not going to work because of btoa and atob conversion.

Niccolò Campolungo Over a year ago

Done. The same is true for the stringToUint function, just remove the btoa function and you're done :)

Pengő Dzsó Over a year ago

You saved my day! Just one addition, that if you use it with huge arrays, you can easily get: [Error] RangeError: Maximum call stack size exceeded. To fix that I use .slice() and apply it in chunks

Tchakabam Over a year ago

This answer is outdated, go here: stackoverflow.com/questions/6965107/…

|

Will · Accepted Answer · 2018-05-02 05:21:27Z

This should work:

// http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt

/* utf.js - UTF-8 <=> UTF-16 convertion
 *
 * Copyright (C) 1999 Masanao Izumo <[email protected]>
 * Version: 1.0
 * LastModified: Dec 25 1999
 * This library is free.  You can redistribute it and/or modify it.
 */

function Utf8ArrayToStr(array) {
  var out, i, len, c;
  var char2, char3;

  out = "";
  len = array.length;
  i = 0;
  while (i < len) {
    c = array[i++];
    switch (c >> 4)
    { 
      case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
        // 0xxxxxxx
        out += String.fromCharCode(c);
        break;
      case 12: case 13:
        // 110x xxxx   10xx xxxx
        char2 = array[i++];
        out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F));
        break;
      case 14:
        // 1110 xxxx  10xx xxxx  10xx xxxx
        char2 = array[i++];
        char3 = array[i++];
        out += String.fromCharCode(((c & 0x0F) << 12) |
                                   ((char2 & 0x3F) << 6) |
                                   ((char3 & 0x3F) << 0));
        break;
    }
  }    
  return out;
}

It's somewhat cleaner as the other solutions because it doesn't use any hacks nor depends on Browser JS functions, e.g. works also in other JS environments.

Check out the JSFiddle demo.

Also see the related questions: here, here

This is the least readable code I've ever seen to implement char-code to string conversion. I appreciate and admire the effort put into it, but there's 100s of more maintainable ways to achieve that.

popham · Accepted Answer · 2014-05-13 22:05:44Z

25

There's a polyfill for Encoding over on Github: text-encoding. It's easy for Node or the browser, and the Readme advises the following:

var uint8array = TextEncoder(encoding).encode(string);
var string = TextDecoder(encoding).decode(uint8array);

If I recall, 'utf-8' is the encoding you need, and of course you'll need to wrap your buffer:

var uint8array = new Uint8Array(utf8buffer);

Hope it works as well for you as it has for me.

answered May 13, 2014 at 22:05

popham

6026 silver badges11 bronze badges

3 Comments

Evan Hu Over a year ago

For anyone lazy like me, npm install text-encoding, var textEncoding = require('text-encoding'); var TextDecoder = textEncoding.TextDecoder;. No thanks.

Benproductions1 Over a year ago

@KarthikHande That's what the polyfill is for. Its not supported by all browsers so you also supply a pure js implementation as an alternative.

wayofthefuture Over a year ago

Beware the library is HUGE

Esailija · Accepted Answer · 2013-06-19 13:39:45Z

13

If you are doing this in browser there are no character encoding libraries built-in, but you can get by with:

function pad(n) {
    return n.length < 2 ? "0" + n : n;
}

var array = new Uint8Array(data);
var str = "";
for( var i = 0, len = array.length; i < len; ++i ) {
    str += ( "%" + pad(array[i].toString(16)))
}

str = decodeURIComponent(str);

Here's a demo that decodes a 3-byte UTF-8 unit: http://jsfiddle.net/Z9pQE/

answered Jun 19, 2013 at 13:39

Esailija

140k24 gold badges280 silver badges328 bronze badges

1 Comment

fiatjaf Over a year ago

You're the best person in the world.

Martin Wantke · Accepted Answer · 2017-10-28 13:12:19Z

The methods readAsArrayBuffer and readAsText from a FileReader object converts a Blob object to an ArrayBuffer or to a DOMString asynchronous.

A Blob object type can be created from a raw text or byte array, for example.

let blob = new Blob([text], { type: "text/plain" });

let reader = new FileReader();
reader.onload = event =>
{
    let buffer = event.target.result;
};
reader.readAsArrayBuffer(blob);

I think it's better to pack up this in a promise:

function textToByteArray(text)
{
    let blob = new Blob([text], { type: "text/plain" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result));
    };
    reader.readAsArrayBuffer(blob);

    return { done: function(callback) { done = callback; } }
}

function byteArrayToText(bytes, encoding)
{
    let blob = new Blob([bytes], { type: "application/octet-stream" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(event.target.result);
    };

    if(encoding) { reader.readAsText(blob, encoding); } else { reader.readAsText(blob); }

    return { done: function(callback) { done = callback; } }
}

let text = "\uD83D\uDCA9 = \u2661";
textToByteArray(text).done(bytes =>
{
    console.log(bytes);
    byteArrayToText(bytes, 'UTF-8').done(text => 
    {
        console.log(text); // 💩 = ♡
    });
});

Rosberg Linhares · Accepted Answer · 2019-12-05 04:24:47Z

If you don't want to use any external polyfill library, you can use this function provided by the Mozilla Developer Network website:

function utf8ArrayToString(aBytes) {
    var sView = "";
    
    for (var nPart, nLen = aBytes.length, nIdx = 0; nIdx < nLen; nIdx++) {
        nPart = aBytes[nIdx];
        
        sView += String.fromCharCode(
            nPart > 251 && nPart < 254 && nIdx + 5 < nLen ? /* six bytes */
                /* (nPart - 252 << 30) may be not so safe in ECMAScript! So...: */
                (nPart - 252) * 1073741824 + (aBytes[++nIdx] - 128 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 247 && nPart < 252 && nIdx + 4 < nLen ? /* five bytes */
                (nPart - 248 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 239 && nPart < 248 && nIdx + 3 < nLen ? /* four bytes */
                (nPart - 240 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 223 && nPart < 240 && nIdx + 2 < nLen ? /* three bytes */
                (nPart - 224 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 191 && nPart < 224 && nIdx + 1 < nLen ? /* two bytes */
                (nPart - 192 << 6) + aBytes[++nIdx] - 128
            : /* nPart < 127 ? */ /* one byte */
                nPart
        );
    }
    
    return sView;
}

let str = utf8ArrayToString([50,72,226,130,130,32,43,32,79,226,130,130,32,226,135,140,32,50,72,226,130,130,79]);

// Must show 2H₂ + O₂ ⇌ 2H₂O
console.log(str);

see up-to-date answer: stackoverflow.com/questions/6965107/…

konak · Accepted Answer · 2018-01-20 14:24:54Z

The main problem of programmers looking for conversion from byte array into a string is UTF-8 encoding (compression) of unicode characters. This code will help you:

var getString = function (strBytes) {

    var MAX_SIZE = 0x4000;
    var codeUnits = [];
    var highSurrogate;
    var lowSurrogate;
    var index = -1;

    var result = '';

    while (++index < strBytes.length) {
        var codePoint = Number(strBytes[index]);

        if (codePoint === (codePoint & 0x7F)) {

        } else if (0xF0 === (codePoint & 0xF0)) {
            codePoint ^= 0xF0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xE0 === (codePoint & 0xE0)) {
            codePoint ^= 0xE0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xC0 === (codePoint & 0xC0)) {
            codePoint ^= 0xC0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        }

        if (!isFinite(codePoint) || codePoint < 0 || codePoint > 0x10FFFF || Math.floor(codePoint) != codePoint)
            throw RangeError('Invalid code point: ' + codePoint);

        if (codePoint <= 0xFFFF)
            codeUnits.push(codePoint);
        else {
            codePoint -= 0x10000;
            highSurrogate = (codePoint >> 10) | 0xD800;
            lowSurrogate = (codePoint % 0x400) | 0xDC00;
            codeUnits.push(highSurrogate, lowSurrogate);
        }
        if (index + 1 == strBytes.length || codeUnits.length > MAX_SIZE) {
            result += String.fromCharCode.apply(null, codeUnits);
            codeUnits.length = 0;
        }
    }

    return result;
}

All the best !

Thats not complete. For samplle, german umlauts are missing!
By the way ... I have noticed that there was invalid ordering in if statements. May be that was a problem your string was not processed. I have corrected in my codes, but forget to correct it in this post.
ö = RangeError: Invalid code point: 1581184, ü = RangeError: Invalid code point: 3678336
I have changed code above. please try it one more time. There was a problem with "else if" statements ordering .. Now it must work for your case too. That code was tested for more than 30 languages including Japan, korean, Arabic etc. languages.
For example here are words I have transferred using bytes and restored to string in Javascript: Hälfte, Über,

Collectives™ on Stack Overflow

Conversion between UTF-8 ArrayBuffer and String

8 Answers 8

8 Comments

11 Comments

2 Comments

3 Comments

1 Comment

Comments

1 Comment

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

8 Comments

11 Comments

2 Comments

3 Comments

1 Comment

Comments

1 Comment

5 Comments

Linked

Related