Unicode Confusables Detector
The Unicode Consortium publishes a confusables.txt data file listing character pairs that are visually indistinguishable. This tool uses a curated subset of that data to detect the most security-relevant confusables in any text.
A confusable (Unicode term) is any character that a reasonable reader could mistake for a different character. The Unicode Security Mechanisms specification (UTS#39) defines "skeleton" algorithms for normalizing confusable sequences and detecting mixed-script identifiers. The full confusables.txt database contains over 7,000 pairs; the most security-critical are the Latin-lookalikes from Cyrillic, Greek, and fullwidth blocks covered by the Homoglyph Detector.
Script categories detected
- Cyrillic (highlighted red) — А В С Е Н І К М О Р Ѕ Т У Х and their lowercase equivalents. Primary source of IDN homograph attacks and phishing domains.
- Greek (highlighted orange) — Α Β Ε Ζ Η Ι Κ Μ Ν Ο Ρ Τ Υ Χ and lowercase α ο ρ υ χ ν. Used in mixed-script usernames and code identifiers.
- Fullwidth Latin (highlighted blue) — A–Z, a–z, 0–9 (U+FF21–U+FF5A, U+FF10–U+FF19). Designed for CJK typesetting, abused in social-media name spoofing.
- Mathematical variants (highlighted purple) — 𝐀–𝐙, 𝐚–𝐳 (mathematical bold), and italic capitals. Used in emoji-style display names and package name spoofing.
How to detect Unicode confusables
- Paste the text you want to check into the Homoglyph Detector input area.
- Click Analyze. Each confusable character is highlighted by script category with its Unicode codepoint name and the ASCII character it resembles.
- Use Clean to produce an ASCII-only version, or Compare mode to diff two strings at the codepoint level.
UTS#39 and the Unicode Confusables Data
The Unicode Technical Standard #39 (Unicode Security Mechanisms) defines the algorithm for "confusable detection" used by browser engines when evaluating IDN domain names. The source data at unicode.org/Public/security/latest/confusables.txt is updated with each Unicode version. This tool implements a curated security-focused subset covering the characters most frequently observed in real-world attacks.
Related tools
- Homoglyph Detector — interactive detection and cleaning tool
- Cyrillic Homoglyph Reference — per-character Cyrillic table
- Phishing Text Checker — check suspicious URLs
- Invisible Character Detector — zero-width and BiDi attacks
- Unicode Lookup — look up any codepoint by name or number