jarvisbox

Unicode Confusables Detector

The Unicode Consortium publishes a confusables.txt data file listing character pairs that are visually indistinguishable. This tool uses a curated subset of that data to detect the most security-relevant confusables in any text.

A confusable (Unicode term) is any character that a reasonable reader could mistake for a different character. The Unicode Security Mechanisms specification (UTS#39) defines "skeleton" algorithms for normalizing confusable sequences and detecting mixed-script identifiers. The full confusables.txt database contains over 7,000 pairs; the most security-critical are the Latin-lookalikes from Cyrillic, Greek, and fullwidth blocks covered by the Homoglyph Detector.

Script categories detected

How to detect Unicode confusables

  1. Paste the text you want to check into the Homoglyph Detector input area.
  2. Click Analyze. Each confusable character is highlighted by script category with its Unicode codepoint name and the ASCII character it resembles.
  3. Use Clean to produce an ASCII-only version, or Compare mode to diff two strings at the codepoint level.

UTS#39 and the Unicode Confusables Data

The Unicode Technical Standard #39 (Unicode Security Mechanisms) defines the algorithm for "confusable detection" used by browser engines when evaluating IDN domain names. The source data at unicode.org/Public/security/latest/confusables.txt is updated with each Unicode version. This tool implements a curated security-focused subset covering the characters most frequently observed in real-world attacks.

Related tools

このツールの問題を報告