Files
keydr/docs/plans/2026-02-22-n-gram-error-tracking-adaptive-drill-selection.md

19 KiB

N-gram Error Tracking for Adaptive Drill Selection

Context

keydr currently tracks typing errors at the single-character level only. The adaptive algorithm picks the weakest character by confidence score and biases drill text to include words containing that character. This misses transition difficulties -- sequences where individual characters are easy but the combination is hard (e.g., same-finger bigrams, awkward hand transitions). Research strongly supports that these transition effects are real and distinct from single-character difficulty.

Goal: Add bigram (n=2) and trigram (n=3) error tracking, with a redundancy detection formula that distinguishes genuine transition difficulties from errors that are just proxies for single-character weakness. Integrate problematic bigrams into the adaptive drill selection pipeline. Trigrams are tracked for observation only and not used for drill generation until empirically proven useful.


Research Summary

  1. N-gram tracking is genuinely novel -- No existing typing tutor does comprehensive n-gram error tracking with adaptive drill selection.

  2. Bigrams capture real, distinct information -- The 136M Keystrokes study (Dhakal et al., CHI 2018) found letter pairs typed by different hands are more predictive of speed than character repetitions. This cannot be inferred from single-char data.

  3. Motor chunking is real -- The motor cortex plans keystrokes in chunks, not individually. Single-character optimization misses this.

  4. Bigrams are the sweet spot -- Nearly all keyboard layout research focuses on bigrams. Trigrams likely offer diminishing returns.


Core Innovation: Redundancy Detection

The key question: "Is a high-error bigram just a proxy for a high-error character?"

Error Rate Estimation (Laplace-smoothed)

Raw error rates are unstable at low sample counts. All error rates use Laplace smoothing:

smoothed_error_rate(errors, samples) = (errors + 1) / (samples + 2)

This gives a Bayesian prior of 50% error rate that gets pulled toward the true rate as samples accumulate. At 10 samples with 3 errors, this yields 0.333 instead of raw 0.3 -- a small correction. At 2 samples with 1 error, it yields 0.5 instead of raw 0.5 -- stabilizing the estimate.

Bigram Redundancy Formula

For bigram "ab" with characters a and b:

e_a = smoothed_error_rate(char_a.errors, char_a.samples)
e_b = smoothed_error_rate(char_b.errors, char_b.samples)
e_ab = smoothed_error_rate(bigram_ab.errors, bigram_ab.samples)

expected_ab = 1.0 - (1.0 - e_a) * (1.0 - e_b)
redundancy_ab = e_ab / max(expected_ab, 0.01)

Trigram Redundancy Formula

For trigram "abc", redundancy is computed against BOTH individual chars AND constituent bigrams:

// Expected from chars alone (independence assumption)
expected_from_chars = 1.0 - (1.0 - e_a) * (1.0 - e_b) * (1.0 - e_c)

// Expected from bigrams (takes the max -- if either bigram explains the error, no trigram signal)
expected_from_bigrams = max(e_ab, e_bc)

// Use the higher expectation (harder to exceed = more conservative)
expected_abc = max(expected_from_chars, expected_from_bigrams)
redundancy_abc = e_abc / max(expected_abc, 0.01)

This ensures trigrams only flag as informative when NEITHER the individual characters NOR constituent bigrams explain the difficulty.

Focus Eligibility (Stability-Gated)

An n-gram becomes eligible for focus only when ALL conditions hold:

  1. sample_count >= 20 -- minimum statistical reliability
  2. redundancy > 1.5 -- genuine transition difficulty, not a proxy
  3. redundancy_stable == true -- the redundancy score has been > 1.5 for the last 3 consecutive update checks (prevents focus flapping from noisy estimates)

The difficulty score for ranking eligible n-grams:

ngram_difficulty = (1.0 - confidence) * redundancy

Worked Examples

Example 1 -- Proxy (should NOT focus): User struggles with 's'. e_s = 0.25, e_i = 0.03. Expected bigram "is" error: 1 - 0.75 * 0.97 = 0.273. Observed "is" error: 0.28. Redundancy: 0.28 / 0.273 = 1.03. This is ~1.0, confirming "is" errors are just 's' errors. Not eligible.

Example 2 -- Genuine difficulty (should focus): User is fine with 'e' and 'd' individually. e_e = 0.04, e_d = 0.05. Expected "ed" error: 1 - 0.96 * 0.95 = 0.088. Observed "ed" error: 0.22. Redundancy: 0.22 / 0.088 = 2.5. This exceeds 1.5 -- the "ed" transition is genuinely hard. Eligible for focus.

Example 3 -- Trigram vs bigram: e_t = 0.03, e_h = 0.04, e_e = 0.04. Bigram e_th = 0.15 (genuine difficulty). Expected trigram "the" from chars: 0.107. Expected from bigrams: max(0.15, 0.04) = 0.15. Observed "the" error: 0.16. Redundancy: 0.16 / 0.15 = 1.07. Not significant -- the "th" bigram already explains the trigram difficulty. Trigram NOT eligible.


Confidence Scale

NgramStat.confidence uses the same formula as KeyStat.confidence:

target_time_ms = 60000.0 / target_cpm    // 342.86ms at 175 CPM
confidence = target_time_ms / filtered_time_ms
  • confidence < 1.0: Slower than target (needs practice)
  • confidence == 1.0: Exactly at target speed
  • confidence > 1.0: Faster than target (mastered)

For n-grams, target_time_ms scales linearly with order: a bigram target is 2 * single_char_target, a trigram target is 3 * single_char_target. This is approximate but consistent.


Hesitation Tracking

Hesitations indicate cognitive uncertainty even when the correct key is pressed. The threshold is relative to the user's rolling baseline:

hesitation_threshold = max(800.0, 2.5 * user_median_transition_ms)

Where user_median_transition_ms is the median of the user's last 200 inter-keystroke intervals across all drills. The 800ms absolute floor prevents the threshold from being too low for fast typists. The 2.5x multiplier flags transitions that are notably slower than the user's norm.

user_median_transition_ms is stored as a single rolling value on the App struct, updated from per_key_times after each drill.


N-gram Key Representation

N-gram keys use typed arrays instead of strings to avoid encoding/canonicalization issues:

#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct BigramKey(pub [char; 2]);

#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct TrigramKey(pub [char; 3]);

Normalization rules (applied at extraction boundary in extract_ngram_events):

  • All characters are Unicode scalar values (Rust char) -- no grapheme cluster handling needed since the app only supports ASCII typing
  • No case folding -- 'A' and 'a' are distinct (they require different motor actions: shift+a vs a)
  • Punctuation is included (transitions to/from punctuation are legitimate motor sequences)
  • BACKSPACE characters are filtered out before windowing
  • Space characters split windows (no cross-word-boundary n-grams)

Implementation

Phase 1: Core Data Structures & Extraction

New file: src/engine/ngram_stats.rs

  • BigramKey(pub [char; 2]) and TrigramKey(pub [char; 3]) -- typed keys with Hash/Eq/Serialize
  • NgramStat struct:
    • filtered_time_ms: f64 -- EMA-smoothed transition time (alpha=0.1)
    • best_time_ms: f64 -- personal best EMA time
    • confidence: f64 -- (target_time_ms * order) / filtered_time_ms
    • sample_count: usize -- total observations
    • error_count: usize -- total errors (mistype or hesitation)
    • hesitation_count: usize -- total hesitations specifically
    • recent_times: Vec<f64> -- last 30 observations
    • recent_correct: Vec<bool> -- last 30 correctness values
    • redundancy_streak: u8 -- consecutive updates where redundancy > 1.5 (for stability gate, max 255)
  • BigramStatsStore -- HashMap<BigramKey, NgramStat> (concrete, not generic)
    • update(&mut self, key: BigramKey, time_ms: f64, correct: bool, hesitation: bool)
    • get_confidence(&self, key: &BigramKey) -> f64
    • smoothed_error_rate(&self, key: &BigramKey) -> f64 -- Laplace-smoothed
    • redundancy_score(&self, key: &BigramKey, char_stats: &KeyStatsStore) -> f64
    • weakest_bigram(&self, char_stats: &KeyStatsStore, unlocked: &[char]) -> Option<(BigramKey, f64)> -- stability-gated
  • TrigramStatsStore -- HashMap<TrigramKey, NgramStat> (concrete, not generic)
    • Same update/query methods as BigramStatsStore
    • prune(&mut self, max_entries: usize) -- composite utility pruning (see below)
  • Internal: shared helper functions/trait for the common EMA update logic to avoid duplication between bigram and trigram stores
  • BigramEvent / TrigramEvent structs -- { key, total_time_ms, correct, has_hesitation }
  • extract_ngram_events(per_key_times: &[KeyTime], hesitation_threshold: f64) -> (Vec<BigramEvent>, Vec<TrigramEvent>) -- single pass, returns both orders
  • FocusTarget enum -- Char(char) | Bigram(BigramKey) -- lives in src/engine/ngram_stats.rs, re-exported from src/engine/mod.rs

Note: KeyStatsStore needs a new method smoothed_error_rate(key: char) -> f64 to provide Laplace-smoothed error rates. This requires adding error_count to KeyStat. Currently KeyStat only tracks timing for correct keystrokes -- we need to also count errors. Add error_count: usize and total_count: usize fields to KeyStat, increment in update_key(). Use #[serde(default)] for backward compat on deserialization.

Modify: src/engine/key_stats.rs (additive)

  • Add error_count: usize and total_count: usize to KeyStat with #[serde(default)]
  • Add update_key_error(&mut self, key: char) -- increments error/total counts without updating timing
  • Add smoothed_error_rate(&self, key: char) -> f64 -- Laplace-smoothed

Modify: src/engine/mod.rs (additive) -- add pub mod ngram_stats, re-export FocusTarget

Extraction detail: For bigram "th", transition time = window[1].time_ms. For trigram "the", transition time = window[1].time_ms + window[2].time_ms. The first element's time_ms is the transition FROM the previous character and is NOT part of this n-gram.

Phase 2: Persistence (Replay-Only, No Caching)

Architecture: drill_history (lesson_history.json) is the sole source of truth. N-gram stats are always rebuilt from drill history on startup. There are no separate n-gram cache files in this initial implementation. This eliminates all cache coherency concerns at the cost of ~200-500ms startup replay. Caching can be added later as an optimization if rebuild latency becomes problematic.

Modify: src/store/schema.rs (additive)

  • Add concrete BigramStatsData { stats: BigramStatsStore } with Default impl
  • Add concrete TrigramStatsData { stats: TrigramStatsStore } with Default impl
  • These types are used for export/import serialization only, not for runtime caching

Modify: src/app.rs (additive + modify existing)

  • Add 4 fields to App: bigram_stats, ranked_bigram_stats, trigram_stats, ranked_trigram_stats
  • Add user_median_transition_ms: f64 and transition_buffer: Vec<f64> (rolling last 200 intervals)
  • On startup: rebuild all n-gram stats + hesitation baseline by replaying drill_history
  • save_data(): no n-gram files to save (stats are always derived)

Trigram pruning: Max 5,000 entries. Prune by composite utility score after history replay:

utility = recency_weight * (1.0 / (drills_since_last_seen + 1))
        + signal_weight * redundancy_score.min(3.0)
        + data_weight * (sample_count as f64).ln()

Where recency_weight=0.3, signal_weight=0.5, data_weight=0.2. Entries with highest utility are kept. This preserves rare-but-informative trigrams over frequent-but-noisy ones.

Phase 3: Drill Integration

Modify: src/app.rs -- finish_drill() (modify existing, after line 847)

  • Compute hesitation_threshold = max(800.0, 2.5 * self.user_median_transition_ms)
  • Call extract_ngram_events(&result.per_key_times, hesitation_threshold)
  • Update bigram_stats and trigram_stats with each event
  • For incorrect keystrokes: also call self.key_stats.update_key_error(kt.key) to build char-level error counts
  • Same pattern for ranked stats in the ranked block (after line 854)
  • Update transition_buffer and recompute user_median_transition_ms

Modify: src/app.rs -- finish_partial_drill() -- same pattern

Hesitation baseline rebuild: During startup history replay, also accumulate transition times into transition_buffer to rebuild user_median_transition_ms. This ensures the hesitation threshold is consistent across restarts.

Phase 4: Adaptive Focus Selection (Bigram Only)

The focus pipeline uses a thin adapter at the App boundary rather than changing generator signatures directly. This minimizes cross-cutting risk.

Modify: src/app.rs -- generate_text() (modify existing, line 628)

// Adapter: compute focus target, then decompose into existing generator knobs
let focus_target = select_focus_target(
    &self.skill_tree, scope, &self.ranked_key_stats, &self.ranked_bigram_stats
);

let (focused_char, focused_bigram) = match &focus_target {
    FocusTarget::Char(ch) => (Some(*ch), None),
    FocusTarget::Bigram(key) => (Some(key.0[0]), Some(key.clone())),
};

// Existing generators use focused_char unchanged
let mut text = generator.generate(&filter, lowercase_focused_char, word_count);
// ... existing capitalize/punctuate/numbers pipeline unchanged ...

// After all generation: if bigram focus, swap some words for bigram-containing words
if let Some(ref bigram) = focused_bigram {
    text = self.apply_bigram_focus(&text, &filter, bigram);
}

New method on App: apply_bigram_focus()

  • Scans generated words, replaces up to 40% with dictionary words containing the target bigram
  • Only replaces when suitable alternatives exist and pass the CharFilter
  • Maintains word count and approximate text length
  • Diversity cap: No more than 3 consecutive bigram-focused words to prevent repetitive feel

This approach keeps ALL existing generator APIs unchanged. If the adapter proves insufficient (e.g., bigram-focused words are too rare in dictionary), we can widen generator APIs in a follow-up.

Focus selection logic (new function select_focus_target() in src/engine/ngram_stats.rs):

  1. Compute weakest single character via existing focused_key()
  2. Compute weakest eligible bigram via weakest_bigram() (stability-gated: sample >= 20, redundancy > 1.5 for 3 consecutive checks)
  3. If bigram ngram_difficulty > char_difficulty * 0.8, focus on bigram
  4. Otherwise, fall back to single-char focus

Phase 5: Information Gain Analysis (Trigram Observation)

Add to src/engine/ngram_stats.rs:

pub fn trigram_marginal_gain(
    trigram_stats: &TrigramStatsStore,
    bigram_stats: &BigramStatsStore,
    char_stats: &KeyStatsStore,
) -> f64

Computes what fraction of trigrams with >= 20 samples have redundancy > 1.5 vs their constituent bigrams. Returns a value in [0.0, 1.0].

  • Called every 50 drills, result logged to a trigram_gain_history: Vec<f64> on the App
  • If the most recent 3 measurements all show gain > 10%, trigrams could be promoted to active focus (future work)
  • This metric is primarily for analysis -- it answers "are trigrams adding value beyond bigrams for this user?"

Phase 6: Export/Import

Modify: src/store/schema.rs (additive) -- add n-gram fields to ExportData with #[serde(default)] Modify: src/store/json_store.rs (additive) -- update export_all() to serialize n-gram stats from memory; import_all() imports them into drill_history replay pipeline


Performance Budgets

Operation Budget Notes
N-gram extraction per drill < 1ms Linear scan of ~200-500 keystrokes
Stats update per drill < 1ms ~400 bigram + ~300 trigram hash map inserts
Focus selection < 5ms Iterate all bigrams (~2K), filter + rank
History replay (full rebuild) < 500ms Replay 500 drills x extraction + update (fixture: 500 drills, 300 keystrokes each)
Memory for n-gram stores < 5MB ~3K bigrams + 5K trigrams x ~200 bytes each

Benchmark tests enforce extraction (<1ms for 500 keystrokes), update (<1ms for 400 events), and focus selection (<5ms for 3K bigrams) budgets.


Files Summary

File Action Breaking? What Changes
src/engine/ngram_stats.rs New No All n-gram structs, extraction, redundancy formula, FocusTarget, focus selection
src/engine/mod.rs Modify No (additive) Add pub mod ngram_stats, re-export FocusTarget
src/engine/key_stats.rs Modify No (additive) Add error_count/total_count to KeyStat with #[serde(default)], add smoothed_error_rate()
src/store/schema.rs Modify No (additive) BigramStatsData/TrigramStatsData types, ExportData update with #[serde(default)]
src/store/json_store.rs Modify No (additive) Export/import n-gram data
src/app.rs Modify No (internal) App fields, finish_drill() n-gram extraction, generate_text() adapter + apply_bigram_focus(), startup replay
src/generator/dictionary.rs Unchanged - Existing find_matching used as-is via adapter
src/generator/phonetic.rs Unchanged - Existing API used as-is via adapter

Verification

  1. Unit tests for extract_ngram_events -- verify bigram/trigram extraction from known keystroke sequences, BACKSPACE filtering, space-boundary skipping, hesitation detection at threshold boundary
  2. Unit tests for redundancy_score -- the 3 worked examples above as test cases, plus edge cases (zero samples, all errors, no errors)
  3. Unit tests for Laplace smoothing -- verify convergence behavior at low and high sample counts
  4. Unit tests for stability gate -- verify redundancy_streak increments/resets correctly, focus eligibility requires 3 consecutive hits
  5. Deterministic integration tests for focus selection -- seed SmallRng with fixed seed, verify tie-breaking behavior between char and bigram focus, verify fallback when no bigrams are eligible
  6. Regression test -- verify existing single-character focus works unchanged when no bigrams have sufficient samples (cold start path)
  7. Benchmark tests (non-blocking, #[bench] or criterion):
    • Extraction: < 1ms for 500 KeyTime entries
    • Update: < 1ms for 400 bigram events
    • Focus selection: < 5ms for 3,000 bigram entries
    • History replay: < 500ms for 500 drills of 300 keystrokes each
  8. Manual test -- deliberately mistype a specific bigram repeatedly, verify it becomes the focus target and subsequent drills contain words with that bigram

Future Considerations (Not in Scope)

  • N-gram cache files for faster startup if replay latency becomes problematic (hybrid append-only cursor approach)
  • Per-order empirical confidence targets instead of linear scaling (calibrate from user data, log diagnostics)
  • Bigram placement control in phonetic generator (prefix/medial/suffix weighting) if adapter approach proves insufficient
  • Trigram-driven focus if marginal gain metric consistently shows > 10% incremental value