Codepoint - the address of every character in Unicode

Last updated: 2026-05-19·~3 min

This article takes about 3 minutes to read.

Every character you can type, paste, or display has a unique address in Unicode called a codepoint.Written as U+ followed by a hexadecimal number (like U+1F600 for 😀), codepoints are the foundation of how computers store and transmit text. Understanding them helps you debug emoji issues, count characters correctly, and build better text-handling code.

Definition

A codepoint is a numerical value in the Unicode codespace (ranging from U+0000 to U+10FFFF). Each codepoint maps to exactly one abstract character. The total codespace allows for 1,114,112 possible codepoints, of which about 150,000 are currently assigned.

How to read a codepoint

The notation U+XXXX (or U+XXXXX for characters above the Basic Multilingual Plane) uses hexadecimal digits:

U+0041 = A (Latin capital letter A)
U+3042 = あ (Hiragana letter A)
U+1F600 = 😀 (Grinning face)
U+1F338 = 🌸 (Cherry blossom)

Characters in the range U+0000 to U+FFFF are in the Basic Multilingual Plane (BMP). Most emoji live above U+FFFF in the Supplementary Multilingual Plane (SMP), which is why they require special handling in some programming languages.

Codepoints vs. characters vs. glyphs

Concept	What it is	Example
Codepoint	The number (address)	U+1F600
Character	The abstract concept	"Grinning face"
Glyph	The visual rendering	😀 (varies by platform)

Why emoji codepoints matter for developers

String length surprises

In JavaScript, "😀".length returns 2, not 1. This is because JavaScript strings use UTF-16 encoding, and emoji above U+FFFF require a surrogate pair (two 16-bit code units). To count actual characters, use:

[..."😀"].length → 1 (spread into array of codepoints)
new Intl.Segmenter() for grapheme cluster counting

Database storage

MySQL's utf8 charset only supports up to 3 bytes per character (BMP only). Emoji require 4 bytes, so you need utf8mb4 to store them without data loss. PostgreSQL's text type handles all Unicode natively.

Regular expressions

Matching emoji in regex requires Unicode-aware patterns. In JavaScript: /\p{Emoji}/u matches any emoji character. Without the u flag, surrogate pairs won't match correctly.

Finding a character's codepoint

Browser DevTools: "🌸".codePointAt(0).toString(16) → "1f338"
Python: hex(ord("🌸")) → "0x1f338"
Command line: printf '%x\n' "'🌸"
Online: Unicode Charts for browsing by block

Codepoints in emoji art

When crafting emoji combos, knowing codepoints helps you:

Identify why a character renders as a blank square (unsupported codepoint)
Distinguish between visually similar characters from different blocks
Debug copy-paste issues where invisible characters (ZWJ, variation selectors) get stripped
Calculate the true "cost" of a combo in platforms that count codepoints

Related terms

ZWJ (Zero Width Joiner) - invisible character that combines emoji