Codepoint - the address of every character in Unicode
This article takes about 3 minutes to read.
Every character you can type, paste, or display has a unique address in Unicode called a codepoint.Written as U+ followed by a hexadecimal number (like U+1F600 for 😀), codepoints are the foundation of how computers store and transmit text. Understanding them helps you debug emoji issues, count characters correctly, and build better text-handling code.
Definition
A codepoint is a numerical value in the Unicode codespace (ranging from U+0000 to U+10FFFF). Each codepoint maps to exactly one abstract character. The total codespace allows for 1,114,112 possible codepoints, of which about 150,000 are currently assigned.
How to read a codepoint
The notation U+XXXX (or U+XXXXX for characters above the Basic Multilingual Plane) uses hexadecimal digits:
- U+0041 = A (Latin capital letter A)
- U+3042 = あ (Hiragana letter A)
- U+1F600 = 😀 (Grinning face)
- U+1F338 = 🌸 (Cherry blossom)
Characters in the range U+0000 to U+FFFF are in the Basic Multilingual Plane (BMP). Most emoji live above U+FFFF in the Supplementary Multilingual Plane (SMP), which is why they require special handling in some programming languages.
Codepoints vs. characters vs. glyphs
| Concept | What it is | Example |
|---|---|---|
| Codepoint | The number (address) | U+1F600 |
| Character | The abstract concept | "Grinning face" |
| Glyph | The visual rendering | 😀 (varies by platform) |
Why emoji codepoints matter for developers
String length surprises
In JavaScript, "😀".length returns 2, not 1. This is because JavaScript strings use UTF-16 encoding, and emoji above U+FFFF require a surrogate pair (two 16-bit code units). To count actual characters, use:
[..."😀"].length→ 1 (spread into array of codepoints)new Intl.Segmenter()for grapheme cluster counting
Database storage
MySQL's utf8 charset only supports up to 3 bytes per character (BMP only). Emoji require 4 bytes, so you need utf8mb4 to store them without data loss. PostgreSQL's text type handles all Unicode natively.
Regular expressions
Matching emoji in regex requires Unicode-aware patterns. In JavaScript: /\p{Emoji}/u matches any emoji character. Without the u flag, surrogate pairs won't match correctly.
Finding a character's codepoint
- Browser DevTools:
"🌸".codePointAt(0).toString(16)→ "1f338" - Python:
hex(ord("🌸"))→ "0x1f338" - Command line:
printf '%x\n' "'🌸" - Online: Unicode Charts for browsing by block
Codepoints in emoji art
When crafting emoji combos, knowing codepoints helps you:
- Identify why a character renders as a blank square (unsupported codepoint)
- Distinguish between visually similar characters from different blocks
- Debug copy-paste issues where invisible characters (ZWJ, variation selectors) get stripped
- Calculate the true "cost" of a combo in platforms that count codepoints
Related terms
- ZWJ (Zero Width Joiner) - invisible character that combines emoji