EmoArt
Post
Glossary

UTF-16 - the encoding behind JavaScript strings

Last updated: 2026-05-23·~4 min

This article takes about 4 minutes to read.

If your JavaScript code says "🌸".length is 2, you've already met UTF-16. The encoding is invisible until emoji enter the picture.UTF-16 is a Unicode encoding that represents most characters in 2 bytes and supplementary characters (including most emoji) in 4 bytes via surrogate pairs. It's the internal string representation in JavaScript, Java, C#, and parts of Windows. Understanding its quirks is essential for handling emoji correctly.

Definition

UTF-16 (16-bit Unicode Transformation Format) encodes Unicode codepoints into sequences of 16-bit code units. Codepoints in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) take one 16-bit unit (2 bytes). Codepoints above U+FFFF are encoded as surrogate pairs - two 16-bit units together (4 bytes total).

The two-tier structure

Codepoint rangeCode unitsBytesExamples
U+0000 - U+FFFF (BMP)12ASCII, Latin, CJK characters
U+10000 - U+10FFFF2 (surrogate pair)4Most emoji, ancient scripts

Surrogate pairs

For codepoints above U+FFFF, UTF-16 splits the value into two 16-bit code units:

  • High surrogate: U+D800 to U+DBFF (1024 possible values)
  • Low surrogate: U+DC00 to U+DFFF (1024 possible values)

The combination yields 1024 × 1024 = 1,048,576 possible codepoints, exactly covering the supplementary plane (U+10000 to U+10FFFF). Surrogate values themselves are reserved and never appear as standalone characters in valid Unicode text.

Example: 🌸 (U+1F338)

The cherry blossom emoji at codepoint U+1F338 is encoded as:

  • High surrogate: 0xD83C
  • Low surrogate: 0xDF38
  • UTF-16 bytes: D8 3C DF 38 (4 bytes total)

In a JavaScript string, this is stored as two code units. "🌸".length returns 2 because length counts code units, not codepoints.

Where UTF-16 is used

  • JavaScript: String internally; length, charAt, charCodeAt operate on code units
  • Java: String and char are UTF-16
  • C# / .NET: string is UTF-16
  • Windows API: many APIs use UTF-16 for strings
  • Qt, Cocoa, ICU: UTF-16 internally for many string operations

UTF-8, by contrast, is the dominant on-the-wire format (HTTP, JSON, files). Many systems use UTF-8 for storage and transport, UTF-16 for in-memory processing.

The JavaScript gotchas

length vs. visible characters

"a".length              // 1
"あ".length             // 1 (BMP)
"🌸".length             // 2 (surrogate pair)
"👨‍👩‍👧".length            // 8 (3 emoji + 2 ZWJ in surrogate pairs)
[..."🌸"].length        // 1 (codepoint iteration)
[..."👨‍👩‍👧"].length       // 5 (codepoints, not graphemes)

String reversal breaks emoji

"🌸".split("").reverse().join("")
// Broken: surrogates separated, displays as ?? or invalid characters

[..."🌸"].reverse().join("")
// Correct: codepoint-aware reversal

// Even better: use Intl.Segmenter for grapheme clusters

charCodeAt returns surrogate code units

"🌸".charCodeAt(0) returns 55356 (0xD83C, the high surrogate), not the codepoint 127800. Use codePointAt(0) to get the actual codepoint, which combines the surrogates.

UTF-8 vs. UTF-16 trade-offs

AspectUTF-8UTF-16
ASCII size1 byte2 bytes
CJK size3 bytes2 bytes
Emoji size4 bytes4 bytes
Self-synchronizingYesYes (via surrogate ranges)
EndiannessNoneBE/LE variants exist

Common misconceptions

  • ❌ "UTF-16 always uses 2 bytes per character" → ✅ Supplementary characters use 4 bytes via surrogate pairs
  • ❌ "str.length returns the character count" → ✅ It returns the code unit count in UTF-16
  • ❌ "UTF-16 is becoming obsolete" → ✅ It's deeply embedded in JavaScript, Java, .NET, and Windows APIs and isn't going anywhere soon

Related terms

  • Codepoint - what UTF-16 encodes
  • Grapheme cluster - the user-perceived unit, distinct from code units
  • ZWJ - sequences that span multiple surrogate pairs

Was this article helpful?