2026-04-27
Stack Overflow: View Question
Tags: string, compiler-construction, right-to-left, bidi
Score: 0 | Views: 48
The asker is building a compiler for a Hebrew-based programming language and has hit a subtle but fundamental question: when a user presses Shift+0 in an RTL text editor, the character visually appears as an opening bracket on the right side of the screen. But what Unicode code point is actually stored in the file? And does the lexer need to account for this?
This is more interesting than it first appears because it sits at the intersection of three distinct layers that are easy to conflate:
The key insight is that Unicode defines mirrored bracket pairs. U+0028 ( is LEFT PARENTHESIS and U+0029 ) is RIGHT PARENTHESIS. When the UBA renders these in an RTL context, compliant renderers will mirror them visually — so U+0028 will appear on the right side and look like what an LTR reader would call a closing paren. But in the file, it's still U+0028.
The practical answer is: no, the lexer should not flip brackets. The lexer operates on the stored byte sequence, not the visual rendering. If a Hebrew programmer types what visually looks like an opening bracket in their RTL editor, the editor is likely emitting U+0028 (or possibly U+0029, depending on keyboard layout and OS). The lexer should simply match the actual code points in the file.
However, there are important gotchas:
The safest design is to define the grammar purely in terms of Unicode code points (U+0028 opens, U+0029 closes) and let the rendering be the editor's responsibility. This mirrors how every existing RTL-aware programming environment works — the logical structure is direction-independent even if the visual presentation is mirrored.
