24 KiB
keydr Multilingual Dictionary + Keyboard Layout Internationalization Plan
Context
We currently use an English-only dictionary and an ASCII-centric adaptive model:
- Dictionary is hardcoded to
assets/words-en.jsoninsrc/generator/dictionary.rs. - Dictionary ingestion filters to ASCII lowercase only (
is_ascii_lowercase). - Transition table building (
src/generator/transition_table.rs) skips non-ASCII words. - Adaptive drill generation in
src/app.rsbuilds lowercase filter fromis_ascii_lowercase. - Skill tree lowercase branch is fixed to English
a-zfrequency insrc/engine/skill_tree.rs. - Keyboard rendering/hit-testing logic has hardcoded row offsets and row count assumptions in
src/ui/components/keyboard_diagram.rsandsrc/ui/components/stats_dashboard.rs.
Explicit product decision: clean break
This app is currently work-in-progress and has no real user base. We explicitly do not need to preserve old config/state/export compatibility for this change. If data must be recreated from scratch, that is acceptable.
Goals
- Add user-selectable dictionary language (default
en) using keybr-provided dictionary files. - Add user-selectable keyboard layout profiles for multiple languages.
- Ensure keyboard visualizations, explorer, and stats heatmaps render correctly for variable row shapes and non-English keycaps.
- Use a clean-break implementation with no backward-compatibility requirements.
- Maintain license compliance for newly imported dictionaries.
Non-goals (first delivery)
- Full IME/dead-key composition support.
- Full rewrite of adaptive model for every script from day one.
- Perfect locale-specific pedagogy for all languages in phase 1.
- Backward compatibility for old config/profile/export data.
Execution constraints (must be explicit before implementation)
- Unicode normalization policy: Use NFC as canonical storage/matching form for dictionary ingestion, generated text, keystroke comparison, and persisted stats keys. Do not use NFKC in phase 1 to avoid compatibility-fold surprises.
- Character equivalence policy: Equality is by normalized scalar sequence (NFC), not by glyph appearance. Composed/decomposed equivalents must compare equal after normalization.
- Clean-break schema cutover policy: This rollout uses hard reset semantics for old unscoped stats/profile files. On first run of the new schema version, old files are ignored (optionally archived with
.legacysuffix); no partial migration path. - Capability gating policy: Only language/layout pairs marked supported in the registry capability matrix are selectable in UI during phased rollout.
- Performance envelope policy: Keyboard geometry recomputation must be bounded and cached by profile key + render mode + viewport size.
Upstream data availability
keybr-content-words includes dictionaries for:
ar, be, cs, da, de, el, en, es, et, fa, fi, fr, he, hr, hu, it, ja, lt, lv, nb, nl, pl, pt, ro, ru, sl, sv, th, tr, uk
Recommended rollout strategy:
- Initial support for Latin-script languages first (
en, de, es, fr, it, pt, nl, sv, da, nb, fi, pl, cs, ro, hr, hu, lt, lv, sl, et, tr). - Later support for non-Latin scripts (
el, ru, uk, be, ar, fa, he, ja, th) after script-specific input/model behavior is in place.
Key Architectural Decisions
1) Language Pack registry
Add a registry module (e.g. src/l10n/language_pack.rs) containing:
language_keydisplay_namescriptdictionary_asset_idsupported_keyboard_layout_keysprimary_letter_sequence(for ranked progression)starter_weightsand optionalvowel_setfor generator fallback behaviorsupport_level(full,experimental,blocked)normalization_form(phase 1 fixed toNFC)input_capabilities(for exampledirect_letters_only,needs_ime)
This becomes the single source of truth for language behavior.
2) Runtime dictionary/generator rebuild is required
Changing dictionary_language must immediately take effect without restart.
Implement App::rebuild_language_assets(&mut self) that rebuilds:
DictionaryTransitionTable- any cached generator state derived from language assets
- focused-character transforms derived from language rules
- drill-generation allowlists that depend on language pack data
Call it whenever language or language-dependent layout changes in settings.
rebuild_language_assets must also refresh capitalization/case behavior inputs used by adaptive generation.
rebuild_language_assets invalidation contract (required):
- always invalidate and rebuild
DictionaryandTransitionTable - clear adaptive cross-drill dictionary history cache
- clear/refresh any cached language-specific focus mapping
- do not mutate in-progress drill text
- all newly generated drills after rebuild must use new language assets
3) Asset loading strategy: compile-time embedded assets
For Phase 1 scope, dictionaries will be embedded at compile-time (generated asset map + include_str!/equivalent), not runtime file discovery.
Rationale:
- deterministic packaging
- no runtime path resolution complexity
- simpler cross-platform behavior
Tradeoff: larger binary size, acceptable for this phase.
4) Transition table fallback strategy
TransitionTable::build_english() will be gated to language_key == "en" only.
For non-English languages:
- use dictionary-derived transition table only
- if sparse, degrade gracefully to simple dictionary sampling behavior rather than English fallback model
5) Keyboard geometry refactor strategy
src/ui/components/keyboard_diagram.rs is a substantial refactor (all render and hit-test paths).
Implement shared KeyboardGeometry computed once per render context and consumed by:
- compact/full/fallback renderers
- all key hit-testing paths
- shift hit-testing paths
No duplicate hardcoded offsets should remain.
Performance constraints for geometry:
- geometry cache key:
(layout_key, render_mode, viewport_width, viewport_height) - recompute only when cache key changes
- hit-testing must be O(number_of_keys) or better per event with no per-key allocation
- include a benchmark/smoke check to detect regressions in repeated render/hit-test loops
6) Finger assignment source of truth
Finger assignment must be profile metadata, not inferred by QWERTY column heuristics.
Each keyboard profile defines finger mapping for each physical key position.
7) Stats isolation strategy
Stats are language-scoped and layout-scoped.
Adopt per-scope storage files (for example):
key_stats_<language>_<layout>.jsonkey_stats_ranked_<language>_<layout>.json- optional scoped drill history files
No mixed-language key stats in a single store.
Profile/scoring scoping policy:
skill_treeprogress is language-scoped (at minimum bylanguage_key).total_score,total_drills,streak_days, andbest_streakremain global.ProfileDatawill separate global fields from language-scoped progression state.
Scoped-file discovery mechanism:
- registry-driven + current-config driven only
- app loads current scope directly and only enumerates scopes from supported language/layout registry pairs
- no unconstrained glob-based discovery of arbitrary stale files
Import/export strategy for scoped stats:
- export bundles all supported scoped stats files present in the data dir
- each bundle entry includes explicit
language_keyandlayout_keymetadata - import applies two-phase commit per scoped target file
- export/import also includes language-scoped
skill_treeprogress entries withlanguage_keymetadata
Atomicity requirements for scoped import:
- stage writes to
<target>.tmp - flush file contents (
sync_all) before rename - rename temp file onto target atomically where supported
- on any failure, remove temp file and keep existing target untouched
- no commit of partially imported scope bundles
8) Settings architecture
Current index-based settings handling is fragile.
Phase 1 includes refactor from positional integer indices to enum/struct-based settings entries before adding multilingual controls.
Profile key validation must be registry-backed. Do not rely on KeyboardModel::from_name() fallback behavior.
Validation error taxonomy (typed, stable):
UnknownLanguageUnknownLayoutUnsupportedLanguageLayoutPairLanguageBlockedBySupportLevel
UI must show deterministic user-facing error text for each class (used by tests).
In-progress drill behavior on language/layout change:
- language/layout changes rebuild assets immediately for future generation
- current in-progress drill text is not mutated mid-drill
- new language/layout applies on the next drill generation
9) Unicode handling architecture
Define one shared Unicode utility module used by dictionary ingestion, generators, and input matching:
- normalize all dictionary entries to NFC at load time
- normalize typed characters before comparison against expected text
- normalize persisted per-key identifiers before write/read
- provide helper tests for composed/decomposed equivalence (for example
évse + ◌́)
10) Rollout capability matrix architecture
Add a single registry-backed capability matrix keyed by (language_key, layout_key):
enabled: selectable and fully supportedpreview: selectable with warning bannerdisabled: visible but not selectable
Phase-gating must read this matrix in settings and selection screens; no ad-hoc checks.
Phased Implementation
Phase 0: Data + compliance groundwork
Tasks
- Import selected dictionaries to
assets/dictionaries/words-<lang>.json. - Add sidecar license/provenance files for each imported dictionary.
- Update
THIRD_PARTY_NOTICES.mdwith imported assets. - Add validation script for dictionary manifest/checksums.
- Define language pack registry seed data (including temporary
primary_letter_sequencevalues). - Add
support_leveland capability-matrix seed entries for every language/layout pair. - Add a build-time utility that derives letter frequency sequence from each dictionary (seed data source of truth; manual overrides allowed but documented).
- Write
docs/unicode-normalization-policy.md(NFC/equivalence rules + examples).
Verification
- All imported dictionaries listed in third-party notices.
- Sidecar license/provenance file exists for each imported dictionary.
- Manifest validation script passes.
- Build-time frequency derivation utility emits reproducible output for seeded languages.
- Unicode policy doc exists and includes composed/decomposed test cases.
Phase 1: Settings and configuration foundation
Tasks
- Add
dictionary_languagetoConfig. - Refactor settings implementation from raw indices to typed settings entries (enum/descriptor model).
- Add settings controls for:
- dictionary language
- canonical keyboard layout profile key
- Implement explicit invalid combination handling (reject with message), not silent fallback.
- Wire language/layout change actions to
App::rebuild_language_assets(&mut self). - Introduce clean-break schema/version update for config/profile/store formats with hard-reset behavior for old files.
- Replace
from_namewildcard fallback paths with explicit lookup failure handling tied to registry validation. - Update import/export schema and transaction flow for scoped stats bundles.
- Split profile persistence into global fields + language-scoped skill tree progress map.
- Enforce capability-matrix gating in settings/selectors (
enabled/preview/disabledstates). - Add typed validation errors and stable user-facing status messages.
Code areas
src/config.rssrc/main.rs(settings UI rendering and input handling)src/app.rs(settings action handlers, rebuild trigger)src/store/schema.rssrc/store/json_store.rs
Verification
- Unit tests for config defaults/validation.
- Unit tests for settings navigation/editing after index refactor.
- Runtime test: changing dictionary language updates generated drills without restart.
- Runtime test: invalid language/layout pair is rejected with visible error/status.
- Export/import test: scoped stats for multiple language/layout pairs round-trip correctly.
- Runtime test: changing language mid-drill preserves current drill text and applies new language on next drill.
- Schema cutover test: old-format files are ignored/archived and never partially loaded.
- UI test: disabled/preview capability-matrix entries render and behave correctly.
Phase 2: Dictionary, transition table, and generator internationalization
Tasks
- Refactor
Dictionary::load(language_key)with embedded asset map. - Remove ASCII-only filtering from dictionary ingestion and transition building.
- Extend
phonetic.rsto remove English hardcoding:- replace hardcoded starter biases with language-pack starter data or derived frequencies
- replace fallback
"the"with language-aware fallback (for example: top dictionary word) - make vowel recovery optional/parameterized by language pack
- remove
is_ascii_lowercasefocus filtering and rely on allowed-character logic
- Implement transition fallback policy:
build_english()only for English- non-English graceful degradation path without English fallback table
- Address adaptive and non-adaptive mode filters:
- remove hardcoded
('a'..='z')filters in code/passage modes - use language-pack allowed sets where applicable
- remove hardcoded
- Refactor capitalization pipeline to Unicode-aware behavior:
- replace ASCII-only case checks/conversions in
capitalize.rs - use Unicode case mapping and language-pack constraints
- ensure non-ASCII letters (for example
ä/Ä,é/É) are handled correctly
- replace ASCII-only case checks/conversions in
- Implement shared normalization utility and apply it consistently in:
- dictionary load path
- generated text comparison/matching paths
- persisted key identity paths
- Multilingual audit checklist (required pass/fail):
rg -n "is_ascii" src/app.rs src/generator/*.rshas no unreviewed hits affecting multilingual behavior- every remaining
is_ascii*hit has a documented justification comment or issue reference
Code areas
src/generator/dictionary.rssrc/generator/transition_table.rssrc/generator/phonetic.rssrc/generator/capitalize.rssrc/app.rs(adaptive/code/passage filter construction)
Verification
- Unit tests for dictionary loading per supported language.
- Unit tests for transition table generation with non-English characters.
- Unit tests for phonetic fallback behavior per language pack.
- Unit tests for capitalization correctness on non-ASCII letters.
- Regression tests for English output quality.
- Unit tests for NFC normalization and composed/decomposed equivalence.
Phase 3: Keyboard layout profile system
Tasks
- Replace ad-hoc constructors with canonical keyboard profile registry.
- Add language-relevant profiles (
de_qwertz,fr_azerty, etc.). - Add profile metadata:
- key rows and shifted/base pairs
- geometry hints
- modifier placement metadata
- per-key finger assignments
- Remove legacy alias layer and enforce canonical profile keys.
- Evaluate
src/keyboard/layout.rsusage:- if unused, delete it
- otherwise fold it into the new profile registry without duplicate sources of truth
Code areas
src/keyboard/model.rssrc/keyboard/layout.rssrc/keyboard/display.rs(if locale labels/short labels need extension)src/config.rs
Verification
- Unit tests for all canonical profile keys.
- Unit tests for profile completeness and unique key mapping.
- Unit tests for finger assignment coverage/consistency.
Phase 4: Keyboard visualization and hit-testing refactor
Tasks
- Implement shared
KeyboardGeometryused by all keyboard rendering modes. - Rewrite keyboard diagram rendering paths to use shared geometry.
- Rewrite all keyboard hit-testing paths to use shared geometry.
- Refactor stats dashboard keyboard heatmap/timing rendering to use profile geometry metadata.
- Ensure explorer and selection logic works for variable row counts and locale keycaps.
- Update sentinel boundary tests if new files must reference sentinel constants.
- Remove ASCII shift-display guards in keyboard rendering:
- replace
is_ascii_alphabetic()-based shifted display checks - use profile-defined shiftability (
base != shiftedor explicit shiftable set)
- replace
- Audit and replace ASCII-specific input-handling logic in
main.rs:- caps-lock inference
- depressed-key normalization
- shift guidance and shifted-key detection in keyboard UI paths
- Add geometry cache and recompute guards keyed by
(layout_key, render_mode, viewport)with benchmark coverage.
Code areas
src/ui/components/keyboard_diagram.rssrc/ui/components/stats_dashboard.rssrc/main.rskeyboard explorer handlerssrc/main.rsinput handling (handle_key, caps/shift logic, keyboard guidance/render helpers)src/app.rsexplorer state/focus usesrc/keyboard/display.rstests
Verification
- Snapshot/golden tests for compact/full/fallback rendering per profile.
- Hit-test roundtrip tests per profile.
- Manual keyboard explorer smoke tests for US + non-US profiles.
- Sentinel boundary tests pass with updated policy.
- Manual test: shifted rendering works for non-ASCII letter keys where profile defines shifted forms.
- Manual test: caps/shift guidance and depressed-key behavior are correct for non-ASCII key input.
- Benchmark/smoke test: repeated render + hit-test loops meet baseline without per-frame geometry rebuild when cache key is unchanged.
Phase 5: Skill tree and ranked progression internationalization
Tasks
- Replace fixed English lowercase progression with language-pack
primary_letter_sequence. - Replace hardcoded "lowercase as background" branch logic with language-pack primary-letter background behavior.
- Remove UI copy assumptions of "26 lowercase" and
a-z. - Ensure ranked gating uses language-pack readiness (sequence + profile support).
- Define letter-frequency derivation approach:
- derive initial sequence from dictionary frequency data (build-time utility), not hand-curated long-term
- Milestone-copy audit checklist (required pass/fail):
- grep for hardcoded milestone language in
main.rs(26,a-z,A-Z,lowercase) - replace with language-pack-aware dynamic copy
- add tests asserting copy adjusts with different sequence lengths
- grep for hardcoded milestone language in
Code areas
src/engine/skill_tree.rssrc/app.rs(focus/background/filter logic)src/main.rs(milestone/help copy)
Verification
- Tests for progression with multiple language sequences.
- Tests for background-branch selection correctness.
- Snapshot tests for milestone text across languages.
Phase 6: UX polish, test parameterization, and rollout
Tasks
- Add dedicated language/layout selector screens where needed.
- Implemented in
src/main.rs+src/app.rswithDictionaryLanguageSelectandKeyboardLayoutSelect.
- Implemented in
- Add explicit support-matrix messaging for partially supported scripts.
- Implemented in selector + settings UI copy in
src/main.rs(preview/disabledstate messaging).
- Implemented in selector + settings UI copy in
- Add parameterized test helpers:
- language-aware allowed key sets
- expected progression counts
- profile fixtures
- Implemented via cross-language/layout fixtures and property tests in
src/l10n/language_pack.rs,src/engine/skill_tree.rs, andsrc/ui/components/keyboard_diagram.rs.
- Document that Phase 2 may temporarily allow language/dictionary mismatch with keyboard visuals until Phase 3/4 is complete.
- Add explicit note in docs that Phase 2 mismatch window is expected and resolved by Phase 4.
- Implemented in
docs/multilingual-rollout-notes.md.
- Implemented in
- Add cross-language property tests:
- key uniqueness per profile
- hit-test round-trip invariants
- progression monotonicity per language sequence
- Implemented in
src/keyboard/model.rs,src/ui/components/keyboard_diagram.rs, andsrc/engine/skill_tree.rs.
Code areas
src/main.rssrc/app.rs- test modules across
src/* docs/
Verification
- End-to-end manual flows for language switch + layout switch + drill generation + keyboard explorer + stats.
- Performance checks for embedded dictionary footprint and startup latency.
- Test suite passes with parameterized language/profile cases.
- Property/invariant tests pass for key uniqueness, hit-test round-trip, and progression monotonicity.
File-by-file Impact Matrix
Core config and app wiring
src/config.rs- add
dictionary_languageand canonicalkeyboard_layoutprofile key validation
- add
src/app.rs- add
rebuild_language_assets - remove ASCII-only filters and audit residual ASCII assumptions (
rg is_asciipass) - wire settings actions to runtime rebuild
- add
src/main.rs- refactor settings UI to typed entries
- add/update selectors and error/status handling
- audit/replace ASCII-specific input/caps/shift handling
Generators and adaptive engine
src/generator/dictionary.rs- dynamic, language-aware load via embedded registry
src/generator/transition_table.rs- non-ASCII support and explicit English-only fallback gating
src/generator/phonetic.rs- remove hardcoded English starter/vowel/fallback assumptions
src/generator/capitalize.rs- replace ASCII-only casing logic with Unicode-aware capitalization rules
Skill progression
src/engine/skill_tree.rs- language-pack primary sequence
- language-pack background branch behavior
Keyboard modeling and visualization
src/keyboard/model.rs- canonical profile registry with per-key finger mapping
src/keyboard/layout.rs- delete or fold into model registry
src/ui/components/keyboard_diagram.rs- shared geometry + full hit-test rewrite
src/ui/components/stats_dashboard.rs- geometry-driven keyboard heatmap/timing rendering
src/keyboard/display.rs- sentinel boundary test updates as needed
Persistence/schema
src/store/schema.rs- clean-break schema/version bump as needed
- split profile data into global fields + language-scoped skill tree progress
src/store/json_store.rs- scoped stats storage by language/layout
- scoped file discovery based on supported registry pairs
- export/import scoped bundle handling with language/layout metadata
- export/import language-scoped skill tree progress entries
Assets/compliance/docs
assets/dictionaries/*assets/dictionaries/*.licenseTHIRD_PARTY_NOTICES.mddocs/license-compliance.mddocs/unicode-normalization-policy.md
Risks and mitigations
- Risk: Non-Latin scripts break assumptions in multiple modules.
- Mitigation: staged rollout by script; support matrix gating.
- Risk: Keyboard visualization regressions during geometry rewrite.
- Mitigation: shared geometry abstraction + dedicated hit-test/render tests.
- Risk: Clean-break schema reset discards local data.
- Mitigation: explicitly documented and accepted by product decision.
- Risk: Settings refactor increases short-term scope.
- Mitigation: do it early to avoid repeated index-cascade bugs.
- Risk: Embedded dictionary set increases binary size/startup memory.
- Mitigation: track size/startup metrics per release and switch to hybrid packaging if thresholds are exceeded.
Definition of Done
- Language switch updates dictionary-driven generation without restart.
- Keyboard profiles are canonical and language-aware; no legacy alias dependency.
- Keyboard diagram, explorer, and stats views are geometry-driven and correct for supported profiles.
- Ranked progression uses language-pack primary sequences and background logic.
- Code/passage/adaptive modes no longer depend on hardcoded
a-zfilters. - Stats are isolated by language/layout scope.
- Skill tree progression is language-scoped while streak/score totals remain global.
- Third-party attributions and license sidecars cover all imported dictionary assets.
- Automated tests cover runtime rebuild, generator behavior, keyboard geometry/hit-testing, progression invariants, and parameterized language/profile cases.
- Unicode normalization policy is implemented and tested across ingestion, generation, input matching, and persisted stats keys.
- Clean-break schema cutover behavior is deterministic (hard-reset semantics) and covered by automated tests.
- Capability matrix gating is enforced consistently across settings/selectors and covered by UI/runtime tests.