Stack Overflow Unanswered: Does the direction of brackets need to be flipped in an RTL lexer

Does the direction of brackets need to be flipped in an RTL lexer

2026-04-27

Stack Overflow: View Question

Tags: string, compiler-construction, right-to-left, bidi

Score: 0 | Views: 48

The asker is building a compiler for a Hebrew-based programming language and has hit a subtle but fundamental question: when a user presses Shift+0 in an RTL text editor, the character visually appears as an opening bracket on the right side of the screen. But what Unicode code point is actually stored in the file? And does the lexer need to account for this?

This is more interesting than it first appears because it sits at the intersection of three distinct layers that are easy to conflate:

The keyboard layer: What physical key produces what code point.
The rendering layer: How the Unicode Bidirectional Algorithm (UBA) displays paired characters.
The lexer layer: What bytes are actually present in the source file.

The key insight is that Unicode defines mirrored bracket pairs. U+0028 ( is LEFT PARENTHESIS and U+0029 ) is RIGHT PARENTHESIS. When the UBA renders these in an RTL context, compliant renderers will mirror them visually — so U+0028 will appear on the right side and look like what an LTR reader would call a closing paren. But in the file, it's still U+0028.

The practical answer is: no, the lexer should not flip brackets. The lexer operates on the stored byte sequence, not the visual rendering. If a Hebrew programmer types what visually looks like an opening bracket in their RTL editor, the editor is likely emitting U+0028 (or possibly U+0029, depending on keyboard layout and OS). The lexer should simply match the actual code points in the file.

However, there are important gotchas:

Keyboard layout variance: Hebrew keyboard layouts on Windows, macOS, and Linux may differ in which code point Shift+9/Shift+0 produce. The compiler author should test what major Hebrew layouts actually emit and document the expected behavior.
Editor behavior: Some editors apply "auto-pairing" logic that may swap bracket direction in RTL contexts. The source file on disk is what matters.
Mixed directionality: If the language allows embedded LTR identifiers (e.g., calling into English-named APIs), bracket semantics in mixed-direction lines become visually confusing. Clear documentation and good error messages will be essential.
Unicode has explicit RTL brackets: Characters like U+FD3E and U+FD3F are ornate Arabic parentheses. The language designer must decide whether to accept these as valid bracket tokens.

The safest design is to define the grammar purely in terms of Unicode code points (U+0028 opens, U+0029 closes) and let the rendering be the editor's responsibility. This mirrors how every existing RTL-aware programming environment works — the logical structure is direction-independent even if the visual presentation is mirrored.

The challenge: Separating the visual rendering of bidirectional text from the actual byte-level semantics a lexer must operate on, in a domain where almost no prior art exists for RTL-native programming languages.

All newsletters