Base-64 Encoding and Decoding in JavaScript

There are significant implications when working with UTF-8 encoded text in JavaScript, especially when using the atob() and btoa() functions, as these functions are not directly compatible with UTF-8.

The Problem

You can't directly encode or decode UTF-8 text with btoa()/atob() without some conversion. It is necessary to use helper functions or modern APIs like those described below to ensure correct handling of UTF-8 text for Base64 encoding and decoding in JavaScript.

Here's where the problem arises:

Character Encoding Mismatch: atob() and btoa() operate on binary data represented as Latin-1 (ISO-8859-1) strings. This means they work on single-byte characters in the range 0–255. UTF-8 can encode characters using multiple bytes. If you try to use btoa() with a UTF-8 string containing multi-byte characters, you'll get an InvalidCharacterError because btoa() can't handle characters outside the Latin-1 range.

JavaScript's Internal Representation: JavaScript strings are encoded in UTF-16, meaning characters outside the Basic Multilingual Plane (BMP) are represented as surrogate pairs, which are also incompatible with btoa()/atob().

Base64 and UTF-8: Base64 encoding expects binary data as input. To encode a UTF-8 string to Base64, you must first convert it to its binary representation (as a byte array) and then encode it. The reverse is true for decoding: you must decode the Base64 into a binary byte array and then interpret it as a UTF-8 string.

Practical Solutions

To safely handle Base64 encoding and decoding of UTF-8 text, you need intermediate conversion steps. Here's how to do it:

Encoding UTF-8 to Base64

function utf8ToBase64(str) {
    return btoa(unescape(encodeURIComponent(str)));
}

encodeURIComponent(str) encodes the string in UTF-8.
unescape() converts the percent-encoded UTF-8 bytes into a Latin-1 string suitable for btoa().

Decoding Base64 to UTF-8

function base64ToUtf8(base64) {
    return decodeURIComponent(escape(atob(base64)));
}

atob(base64) decodes the Base64 into a Latin-1 string.
escape() converts the Latin-1 string to a percent-encoded string.
decodeURIComponent() interprets the percent-encoded string as UTF-8.

Example Usage

const utf8String = "Hello, 🌍!"; // UTF-8 string with an emoji
const base64 = utf8ToBase64(utf8String);
console.log(base64); // Encoded Base64 string

const decodedString = base64ToUtf8(base64);
console.log(decodedString); // "Hello, 🌍!"

Alternative with Modern APIs

Using modern browser APIs like TextEncoder and TextDecoder, you can work with UTF-8 and Base64 more directly:

Encoding UTF-8 to Base64

function utf8ToBase64Modern(str) {
    const encoder = new TextEncoder();
    const data = encoder.encode(str);
    return btoa(String.fromCharCode(...data));
}

Decoding Base64 to UTF-8

function base64ToUtf8Modern(base64) {
    const binaryString = atob(base64);
    const binaryData = Uint8Array.from(binaryString, char => char.charCodeAt(0));
    const decoder = new TextDecoder();
    return decoder.decode(binaryData);
}

Why Use Modern APIs?

Efficiency: Avoids intermediate string manipulations (escape/unescape are deprecated).
Clarity: Directly handles encoding and decoding binary data as UTF-8.