0

I have a string which is base64 and I need to convert it into utf-8.

base64_string "VABpAG0AZQAgAHMAZQByAGUAaQBzAA=="

I am trying to convert base64_string into utf-8 in the following env:

In browser

method : atob(base64_string)

`Result = "Time series",` 

which is correct. We can verify the same in https://www.base64decode.org

In NodeJs I am converting with npm package "atob"

method : atob(base64_string)

Result = "T i m e  s e r i e s".

For some reasons, I am getting spaces between each character and I don't know why? I have tried to trim, but that is also not working.

1 Answer 1

2

TL;DR;

Your string is actually UTF-16, not UTF-8. Here's how to decode it properly.

function atob(b64txt) {
  const buff = Buffer.from(b64txt, 'base64');
  const txt = buff.toString('utf16le');
  return txt;
}

Explanation: Your base64 encoded string isn't actually UTF-8 or ASCII data. It's UTF-16 (little-endian). That means every character always has a minimum of two bytes.

UTF-8 is different: any byte that is less than 127 indicates a single-byte character. A byte greater than 127 would have a second byte, and if the second byte is > 127 there would be a third byte, etc.

So let's decode your string to character codes and see what it looks like:

const b64txt = 'VABpAG0AZQAgAHMAZQByAGUAaQBzAA==';
const buff = Buffer.from(b64txt, 'base64');
console.log(JSON.stringify(buff));
// >> {"type":"Buffer","data":[84,0,105,0,109,0,101,0,32,0,115,0,101,0,114,0,101,0,105,0,115,0]}

First character (84) is the ASCII character for T. But it's less than 127, and it still has a 0 byte following it. So...not UTF-8.

That's the clue that this string has two bytes per character, making it UTF-16. And the fact that the 0 follows the character is the clue that it's "little-endian" (the 0-255 byte comes first, and the 256-65536 byte comes second).

If you want to change this buffer into text, you need to interpret it as the correct type of string:

const txt = buff.toString('utf16le'); // <- UTF-16, little-endian
console.log(txt);
// >> "Time sereis"

So in node.js, if you combine those two commands, you end up with a full fledged solution to get your string decoded properly, as above in the TL;DR;.

Of course if your encoding type changes, you'd have to change this as well, and do toString('utf8') or whatever the appropriate encoding is.

(credit: I referenced this and this as I was drafting this answer.)

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.