EmoArt
Post
Glossary

Codepoint - the address of every character in Unicode

Last updated: 2026-05-19·~3 min

This article takes about 3 minutes to read.

Every character you can type, paste, or display has a unique address in Unicode called a codepoint.Written as U+ followed by a hexadecimal number (like U+1F600 for 😀), codepoints are the foundation of how computers store and transmit text. Understanding them helps you debug emoji issues, count characters correctly, and build better text-handling code.

Definition

A codepoint is a numerical value in the Unicode codespace (ranging from U+0000 to U+10FFFF). Each codepoint maps to exactly one abstract character. The total codespace allows for 1,114,112 possible codepoints, of which about 150,000 are currently assigned.

How to read a codepoint

The notation U+XXXX (or U+XXXXX for characters above the Basic Multilingual Plane) uses hexadecimal digits:

  • U+0041 = A (Latin capital letter A)
  • U+3042 = あ (Hiragana letter A)
  • U+1F600 = 😀 (Grinning face)
  • U+1F338 = 🌸 (Cherry blossom)

Characters in the range U+0000 to U+FFFF are in the Basic Multilingual Plane (BMP). Most emoji live above U+FFFF in the Supplementary Multilingual Plane (SMP), which is why they require special handling in some programming languages.

Codepoints vs. characters vs. glyphs

ConceptWhat it isExample
CodepointThe number (address)U+1F600
CharacterThe abstract concept"Grinning face"
GlyphThe visual rendering😀 (varies by platform)

Why emoji codepoints matter for developers

String length surprises

In JavaScript, "😀".length returns 2, not 1. This is because JavaScript strings use UTF-16 encoding, and emoji above U+FFFF require a surrogate pair (two 16-bit code units). To count actual characters, use:

  • [..."😀"].length → 1 (spread into array of codepoints)
  • new Intl.Segmenter() for grapheme cluster counting

Database storage

MySQL's utf8 charset only supports up to 3 bytes per character (BMP only). Emoji require 4 bytes, so you need utf8mb4 to store them without data loss. PostgreSQL's text type handles all Unicode natively.

Regular expressions

Matching emoji in regex requires Unicode-aware patterns. In JavaScript: /\p{Emoji}/u matches any emoji character. Without the u flag, surrogate pairs won't match correctly.

Finding a character's codepoint

  • Browser DevTools: "🌸".codePointAt(0).toString(16) → "1f338"
  • Python: hex(ord("🌸")) → "0x1f338"
  • Command line: printf '%x\n' "'🌸"
  • Online: Unicode Charts for browsing by block

Codepoints in emoji art

When crafting emoji combos, knowing codepoints helps you:

  • Identify why a character renders as a blank square (unsupported codepoint)
  • Distinguish between visually similar characters from different blocks
  • Debug copy-paste issues where invisible characters (ZWJ, variation selectors) get stripped
  • Calculate the true "cost" of a combo in platforms that count codepoints

Related terms

Was this article helpful?