UTF-16 - the encoding behind JavaScript strings
This article takes about 4 minutes to read.
If your JavaScript code says "🌸".length is 2, you've already met UTF-16. The encoding is invisible until emoji enter the picture.UTF-16 is a Unicode encoding that represents most characters in 2 bytes and supplementary characters (including most emoji) in 4 bytes via surrogate pairs. It's the internal string representation in JavaScript, Java, C#, and parts of Windows. Understanding its quirks is essential for handling emoji correctly.
Definition
UTF-16 (16-bit Unicode Transformation Format) encodes Unicode codepoints into sequences of 16-bit code units. Codepoints in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) take one 16-bit unit (2 bytes). Codepoints above U+FFFF are encoded as surrogate pairs - two 16-bit units together (4 bytes total).
The two-tier structure
| Codepoint range | Code units | Bytes | Examples |
|---|---|---|---|
| U+0000 - U+FFFF (BMP) | 1 | 2 | ASCII, Latin, CJK characters |
| U+10000 - U+10FFFF | 2 (surrogate pair) | 4 | Most emoji, ancient scripts |
Surrogate pairs
For codepoints above U+FFFF, UTF-16 splits the value into two 16-bit code units:
- High surrogate: U+D800 to U+DBFF (1024 possible values)
- Low surrogate: U+DC00 to U+DFFF (1024 possible values)
The combination yields 1024 × 1024 = 1,048,576 possible codepoints, exactly covering the supplementary plane (U+10000 to U+10FFFF). Surrogate values themselves are reserved and never appear as standalone characters in valid Unicode text.
Example: 🌸 (U+1F338)
The cherry blossom emoji at codepoint U+1F338 is encoded as:
- High surrogate: 0xD83C
- Low surrogate: 0xDF38
- UTF-16 bytes:
D8 3C DF 38(4 bytes total)
In a JavaScript string, this is stored as two code units. "🌸".length returns 2 because length counts code units, not codepoints.
Where UTF-16 is used
- JavaScript:
Stringinternally;length,charAt,charCodeAtoperate on code units - Java:
Stringandcharare UTF-16 - C# / .NET:
stringis UTF-16 - Windows API: many APIs use UTF-16 for strings
- Qt, Cocoa, ICU: UTF-16 internally for many string operations
UTF-8, by contrast, is the dominant on-the-wire format (HTTP, JSON, files). Many systems use UTF-8 for storage and transport, UTF-16 for in-memory processing.
The JavaScript gotchas
length vs. visible characters
"a".length // 1
"あ".length // 1 (BMP)
"🌸".length // 2 (surrogate pair)
"👨👩👧".length // 8 (3 emoji + 2 ZWJ in surrogate pairs)
[..."🌸"].length // 1 (codepoint iteration)
[..."👨👩👧"].length // 5 (codepoints, not graphemes)String reversal breaks emoji
"🌸".split("").reverse().join("")
// Broken: surrogates separated, displays as ?? or invalid characters
[..."🌸"].reverse().join("")
// Correct: codepoint-aware reversal
// Even better: use Intl.Segmenter for grapheme clusterscharCodeAt returns surrogate code units
"🌸".charCodeAt(0) returns 55356 (0xD83C, the high surrogate), not the codepoint 127800. Use codePointAt(0) to get the actual codepoint, which combines the surrogates.
UTF-8 vs. UTF-16 trade-offs
| Aspect | UTF-8 | UTF-16 |
|---|---|---|
| ASCII size | 1 byte | 2 bytes |
| CJK size | 3 bytes | 2 bytes |
| Emoji size | 4 bytes | 4 bytes |
| Self-synchronizing | Yes | Yes (via surrogate ranges) |
| Endianness | None | BE/LE variants exist |
Common misconceptions
- ❌ "UTF-16 always uses 2 bytes per character" → ✅ Supplementary characters use 4 bytes via surrogate pairs
- ❌ "
str.lengthreturns the character count" → ✅ It returns the code unit count in UTF-16 - ❌ "UTF-16 is becoming obsolete" → ✅ It's deeply embedded in JavaScript, Java, .NET, and Windows APIs and isn't going anywhere soon
Related terms
- Codepoint - what UTF-16 encodes
- Grapheme cluster - the user-perceived unit, distinct from code units
- ZWJ - sequences that span multiple surrogate pairs