Increase adaptive drill word diversity

This commit is contained in:
2026-02-26 21:33:16 +00:00
parent 54ddebf054
commit 3ef433404e
3 changed files with 570 additions and 15 deletions

View File

@@ -0,0 +1,90 @@
# Adaptive Drill Word Diversity
## Context
When adaptive drills focus on characters/bigrams with few matching dictionary words, the same words repeat excessively both within and across drills. Currently:
- **Within-drill dedup** uses a sliding window of only 4 words — too small when the matching word pool is small
- **Cross-drill**: no tracking at all — each drill creates a fresh `PhoneticGenerator` with no memory of previous drills
- **Dictionary vs phonetic is binary**: if `matching_words >= 15` use dictionary only, if `< 15` use phonetic only. A pool of 16 words gets 100% dictionary (lots of repeats), while 14 gets 0% dictionary
## Changes
### 1. Cross-drill word history
Add `adaptive_word_history: VecDeque<HashSet<String>>` to `App` that tracks words from the last 5 adaptive drills. Pass a flattened `HashSet<String>` into `PhoneticGenerator::new()`.
**Word normalization**: Capture words from the generator output *before* capitalization/punctuation/numbers post-processing (the `generator.generate()` call in `generate_text()` produces lowercase-only text). This means words in history are always lowercase ASCII with no punctuation — no normalization function needed since the generator already guarantees this format.
**`src/app.rs`**:
- Add `adaptive_word_history: VecDeque<HashSet<String>>` to `App` struct, initialize empty
- In `generate_text()`, before creating the generator: flatten history into `HashSet` and pass to constructor
- After `generator.generate()` returns (before capitalization/punctuation): `split_whitespace()` into a `HashSet`, push to history, pop front if `len > 5`
**Lifecycle/reset rules**:
- Clear `adaptive_word_history` when `drill_mode` changes away from `Adaptive` (i.e., switching to Code/Passage mode)
- Clear when `drill_scope` changes (switching between branches or global/branch)
- Do NOT persist across app restarts — session-local only (it's a `VecDeque`, not serialized)
- Do NOT clear on gradual key unlocks — as the skill tree progresses one key at a time, history should carry over to maintain cross-drill diversity within the same learning progression
- The effective "adaptive context key" is `(drill_mode, drill_scope)` — history clears when either changes. Other parameters (focus char, focus bigram, filter) change naturally within a learning progression and should not trigger resets
- This prevents cross-contamination between unrelated drill contexts while preserving continuity during normal adaptive flow
**`src/generator/phonetic.rs`**:
- Add `cross_drill_history: HashSet<String>` field to `PhoneticGenerator`
- Update constructor to accept it
- In `pick_tiered_word()`, use weighted suppression instead of hard exclusion:
- When selecting a candidate word, if it's in within-drill `recent`, always reject
- If it's in `cross_drill_history`, accept it with reduced probability based on pool coverage:
- Guard: if pool is empty, skip suppression logic entirely (fall through to phonetic generation in hybrid mode)
- `history_coverage = cross_drill_history.intersection(pool).count() as f64 / pool.len() as f64`
- `accept_prob = 0.15 + 0.60 * history_coverage` (range: 15% when history covers few pool words → 75% when history covers most of the pool)
- This prevents over-suppression in small pools where history covers most words, while still penalizing repeats in large pools
- Scale attempt count to `pool_size.clamp(6, 12)` with final fallback accepting any non-recent word
- Compute `accept_prob` once at the start of `generate()` alongside tier categorization (not per-attempt)
### 2. Hybrid dictionary + phonetic mode
Replace the binary threshold with a gradient that mixes dictionary and phonetic words.
**`src/generator/phonetic.rs`**:
- Change constants: `MIN_REAL_WORDS = 8` (below: phonetic only), add `FULL_DICT_THRESHOLD = 60` (above: dictionary only)
- Calculate `dict_ratio` as linear interpolation: `(count - 8) / (60 - 8)` clamped to `[0.0, 1.0]`
- In the word generation loop, for each word: roll against `dict_ratio` to decide dictionary vs phonetic
- Tier categorization still happens when `count >= MIN_REAL_WORDS` (needed for dictionary picks)
- Phonetic words also participate in the `recent` dedup window (already handled since all words push to `recent`)
### 3. Scale within-drill dedup window
Replace the fixed window of 4 with a window proportional to the **filtered dictionary match count** (the `matching_words` vec computed at the top of `generate()`):
- `pool_size <= 20`: window = `pool_size.saturating_sub(1).max(4)`
- `pool_size > 20`: window = `(pool_size / 4).min(20)`
- In hybrid mode, this is based on the dictionary pool size regardless of phonetic mixing — phonetic words add diversity naturally, so the window governs dictionary repeat pressure
### 4. Tests
All tests use seeded `SmallRng::seed_from_u64()` for determinism (existing pattern in codebase).
**Update existing tests**: Add `HashSet::new()` to `PhoneticGenerator::new()` constructor calls (3 tests).
**New tests** (all use `SmallRng::seed_from_u64()` for determinism):
1. **Cross-drill history suppresses repeats**: Generate drill 1 with seeded RNG and constrained filter (~20 matching words), collect word set. Generate drill 2 with same filter but different seed, no history — compute Jaccard index as baseline. Generate drill 2 again with drill 1's words as history — compute Jaccard index. Assert history Jaccard is at least 0.15 lower than baseline Jaccard (i.e., measurably less overlap). Use 100-word drills.
2. **Hybrid mode produces mixed output**: Use a filter that yields ~30 dictionary matches. Generate 500 words with seeded RNG. Collect output words and check against the dictionary match set. With ~30 matches, `dict_ratio ≈ 0.42`. Since the seed is fixed, the output is deterministic — the band of 25%-65% accommodates potential future seed changes rather than runtime variance. Assert dictionary word percentage is within this range, and document the actual observed value for the chosen seed in a comment.
3. **Boundary conditions**: With 5 matching words → assert 0% dictionary words (all phonetic). With 100+ matching words → assert 100% dictionary words. Seeded RNG.
4. **Weighted suppression graceful degradation**: Create a pool of 10 words with history containing 8 of them. Generate 50 words. Verify no panics, output is non-empty, and history words still appear (suppression is soft, not hard exclusion).
## Files to modify
- `src/generator/phonetic.rs` — core changes: hybrid mixing, cross-drill history field, weighted suppression in `pick_tiered_word`, dedup window scaling
- `src/app.rs` — add `adaptive_word_history` field, wire through `generate_text()`, add reset logic on mode/scope changes
- `src/generator/mod.rs` — no changes (`TextGenerator` trait signature unchanged for API stability; the `cross_drill_history` parameter is internal to `PhoneticGenerator`'s constructor, not the trait interface)
## Verification
1. `cargo test` — all existing and new tests pass
2. Manual test: start adaptive drill on an early skill tree branch (few unlocked letters, ~15-30 matching words). Run 5+ consecutive drills. Measure: unique words across 5 drills should be notably higher than before (target: >70% unique across 5 drills for pools of 20+ words)
3. Full alphabet test: with all keys unlocked, behavior should be essentially unchanged (dict_ratio ≈ 1.0, large pool, no phonetic mixing)
4. Scope change test: switch between branch drill and global drill, verify no stale history leaks

View File

@@ -280,6 +280,7 @@ pub struct App {
pub trigram_gain_history: Vec<f64>, pub trigram_gain_history: Vec<f64>,
pub current_focus: Option<FocusSelection>, pub current_focus: Option<FocusSelection>,
pub post_drill_input_lock_until: Option<Instant>, pub post_drill_input_lock_until: Option<Instant>,
adaptive_word_history: VecDeque<HashSet<String>>,
rng: SmallRng, rng: SmallRng,
transition_table: TransitionTable, transition_table: TransitionTable,
#[allow(dead_code)] #[allow(dead_code)]
@@ -432,6 +433,7 @@ impl App {
trigram_gain_history: Vec::new(), trigram_gain_history: Vec::new(),
current_focus: None, current_focus: None,
post_drill_input_lock_until: None, post_drill_input_lock_until: None,
adaptive_word_history: VecDeque::new(),
rng: SmallRng::from_entropy(), rng: SmallRng::from_entropy(),
transition_table, transition_table,
dictionary, dictionary,
@@ -711,10 +713,21 @@ impl App {
let table = self.transition_table.clone(); let table = self.transition_table.clone();
let dict = Dictionary::load(); let dict = Dictionary::load();
let rng = SmallRng::from_rng(&mut self.rng).unwrap(); let rng = SmallRng::from_rng(&mut self.rng).unwrap();
let mut generator = PhoneticGenerator::new(table, dict, rng); let cross_drill_history: HashSet<String> =
self.adaptive_word_history.iter().flatten().cloned().collect();
let mut generator =
PhoneticGenerator::new(table, dict, rng, cross_drill_history);
let mut text = let mut text =
generator.generate(&filter, lowercase_focused, focused_bigram, word_count); generator.generate(&filter, lowercase_focused, focused_bigram, word_count);
// Track words for cross-drill history (before capitalization/punctuation)
let drill_words: HashSet<String> =
text.split_whitespace().map(|w| w.to_string()).collect();
self.adaptive_word_history.push_back(drill_words);
if self.adaptive_word_history.len() > 5 {
self.adaptive_word_history.pop_front();
}
// Apply capitalization if uppercase keys are in scope // Apply capitalization if uppercase keys are in scope
let cap_keys: Vec<char> = all_keys let cap_keys: Vec<char> = all_keys
.iter() .iter()
@@ -1497,8 +1510,13 @@ impl App {
self.save_data(); self.save_data();
// Use adaptive mode with branch-specific scope // Use adaptive mode with branch-specific scope
let old_mode = self.drill_mode;
let old_scope = self.drill_scope;
self.drill_mode = DrillMode::Adaptive; self.drill_mode = DrillMode::Adaptive;
self.drill_scope = DrillScope::Branch(branch_id); self.drill_scope = DrillScope::Branch(branch_id);
if old_mode != DrillMode::Adaptive || old_scope != self.drill_scope {
self.adaptive_word_history.clear();
}
self.start_drill(); self.start_drill();
} }
@@ -1657,6 +1675,7 @@ impl App {
// Step 4: Start the drill // Step 4: Start the drill
self.code_download_attempted = false; self.code_download_attempted = false;
self.adaptive_word_history.clear();
self.drill_mode = DrillMode::Code; self.drill_mode = DrillMode::Code;
self.drill_scope = DrillScope::Global; self.drill_scope = DrillScope::Global;
self.start_drill(); self.start_drill();
@@ -1846,6 +1865,7 @@ impl App {
} }
} }
self.adaptive_word_history.clear();
self.drill_mode = DrillMode::Passage; self.drill_mode = DrillMode::Passage;
self.drill_scope = DrillScope::Global; self.drill_scope = DrillScope::Global;
self.start_drill(); self.start_drill();
@@ -2149,3 +2169,138 @@ fn insert_line_breaks(text: &str) -> String {
result result
} }
#[cfg(test)]
mod tests {
use super::*;
use crate::engine::skill_tree::BranchId;
#[test]
fn adaptive_word_history_clears_on_code_mode_switch() {
let mut app = App::new();
// App starts in Adaptive/Global; new() calls start_drill() which populates history
assert_eq!(app.drill_mode, DrillMode::Adaptive);
assert!(
!app.adaptive_word_history.is_empty(),
"History should be populated after initial adaptive drill"
);
// Use the real start_code_drill path. Pre-set language override to skip
// download logic and ensure it reaches the drill-start code path.
app.code_drill_language_override = Some("rust".to_string());
app.start_code_drill();
assert_eq!(app.drill_mode, DrillMode::Code);
assert!(
app.adaptive_word_history.is_empty(),
"History should clear when switching to Code mode via start_code_drill"
);
}
#[test]
fn adaptive_word_history_clears_on_passage_mode_switch() {
let mut app = App::new();
assert_eq!(app.drill_mode, DrillMode::Adaptive);
assert!(!app.adaptive_word_history.is_empty());
// Use the real start_passage_drill path. Pre-set selection override to
// skip download logic and use built-in passages.
app.config.passage_downloads_enabled = false;
app.passage_drill_selection_override = Some("builtin".to_string());
app.start_passage_drill();
assert_eq!(app.drill_mode, DrillMode::Passage);
assert!(
app.adaptive_word_history.is_empty(),
"History should clear when switching to Passage mode via start_passage_drill"
);
}
#[test]
fn adaptive_word_history_clears_on_scope_change() {
let mut app = App::new();
// Start in Adaptive/Global — drill already started in new()
assert_eq!(app.drill_scope, DrillScope::Global);
assert!(!app.adaptive_word_history.is_empty());
// Use start_branch_drill to switch from Global to Branch scope.
// This is the real production path for scope changes.
app.start_branch_drill(BranchId::Lowercase);
assert_eq!(app.drill_scope, DrillScope::Branch(BranchId::Lowercase));
assert_eq!(app.drill_mode, DrillMode::Adaptive);
// History was cleared by the Global->Branch scope change, then repopulated
// by the single start_drill call inside start_branch_drill.
assert_eq!(
app.adaptive_word_history.len(),
1,
"History should have exactly 1 entry after Global->Branch clear + new drill"
);
// Record history state, then switch to a different branch
let history_before = app.adaptive_word_history.clone();
app.start_branch_drill(BranchId::Capitals);
assert_eq!(app.drill_scope, DrillScope::Branch(BranchId::Capitals));
// History was cleared by scope change and repopulated with new drill words.
// New history should not contain the old drill's words.
let old_words: HashSet<String> = history_before.into_iter().flatten().collect();
let new_words: HashSet<String> = app
.adaptive_word_history
.iter()
.flatten()
.cloned()
.collect();
// After clearing, the new history has exactly 1 drill entry (the one just generated).
assert_eq!(
app.adaptive_word_history.len(),
1,
"History should have exactly 1 entry after scope-clearing branch switch"
);
// The new words should mostly differ from old (not a superset or continuation)
assert!(
!new_words.is_subset(&old_words) || new_words.is_empty(),
"New history should not be a subset of old history"
);
}
#[test]
fn adaptive_word_history_persists_within_same_context() {
let mut app = App::new();
// Adaptive/Global: run multiple drills, history should accumulate
let history_after_first = app.adaptive_word_history.len();
app.start_drill();
let history_after_second = app.adaptive_word_history.len();
assert!(
history_after_second > history_after_first,
"History should accumulate across drills: {} -> {}",
history_after_first,
history_after_second
);
assert!(
app.adaptive_word_history.len() <= 5,
"History should be capped at 5 drills"
);
}
#[test]
fn adaptive_word_history_not_cleared_on_same_branch_redrill() {
let mut app = App::new();
// Start a branch drill
app.start_branch_drill(BranchId::Lowercase);
let history_after_first = app.adaptive_word_history.len();
assert_eq!(history_after_first, 1);
// Re-drill the same branch via start_branch_drill — scope doesn't change,
// so history should NOT clear; it should accumulate.
app.start_branch_drill(BranchId::Lowercase);
assert!(
app.adaptive_word_history.len() > history_after_first,
"History should accumulate when re-drilling same branch: {} -> {}",
history_after_first,
app.adaptive_word_history.len()
);
}
}

View File

@@ -1,3 +1,5 @@
use std::collections::HashSet;
use rand::Rng; use rand::Rng;
use rand::rngs::SmallRng; use rand::rngs::SmallRng;
@@ -8,20 +10,32 @@ use crate::generator::transition_table::TransitionTable;
const MIN_WORD_LEN: usize = 3; const MIN_WORD_LEN: usize = 3;
const MAX_WORD_LEN: usize = 10; const MAX_WORD_LEN: usize = 10;
const MIN_REAL_WORDS: usize = 15; const MIN_REAL_WORDS: usize = 8;
const FULL_DICT_THRESHOLD: usize = 60;
pub struct PhoneticGenerator { pub struct PhoneticGenerator {
table: TransitionTable, table: TransitionTable,
dictionary: Dictionary, dictionary: Dictionary,
rng: SmallRng, rng: SmallRng,
cross_drill_history: HashSet<String>,
#[cfg(test)]
pub dict_picks: usize,
} }
impl PhoneticGenerator { impl PhoneticGenerator {
pub fn new(table: TransitionTable, dictionary: Dictionary, rng: SmallRng) -> Self { pub fn new(
table: TransitionTable,
dictionary: Dictionary,
rng: SmallRng,
cross_drill_history: HashSet<String>,
) -> Self {
Self { Self {
table, table,
dictionary, dictionary,
rng, rng,
cross_drill_history,
#[cfg(test)]
dict_picks: 0,
} }
} }
@@ -234,18 +248,33 @@ impl PhoneticGenerator {
char_indices: &[usize], char_indices: &[usize],
other_indices: &[usize], other_indices: &[usize],
recent: &[String], recent: &[String],
cross_drill_accept_prob: f64,
) -> String { ) -> String {
for _ in 0..6 { let max_attempts = all_words.len().clamp(6, 12);
for _ in 0..max_attempts {
let tier = self.select_tier(bigram_indices, char_indices, other_indices); let tier = self.select_tier(bigram_indices, char_indices, other_indices);
let idx = tier[self.rng.gen_range(0..tier.len())]; let idx = tier[self.rng.gen_range(0..tier.len())];
let word = &all_words[idx]; let word = &all_words[idx];
if recent.contains(word) {
continue;
}
if self.cross_drill_history.contains(word) {
if self.rng.gen_bool(cross_drill_accept_prob) {
return word.clone();
}
continue;
}
return word.clone();
}
// Fallback: accept any non-recent word from full pool
for _ in 0..all_words.len() {
let idx = self.rng.gen_range(0..all_words.len());
let word = &all_words[idx];
if !recent.contains(word) { if !recent.contains(word) {
return word.clone(); return word.clone();
} }
} }
// Fallback: accept any word from full pool all_words[self.rng.gen_range(0..all_words.len())].clone()
let idx = self.rng.gen_range(0..all_words.len());
all_words[idx].clone()
} }
fn select_tier<'a>( fn select_tier<'a>(
@@ -328,13 +357,46 @@ impl TextGenerator for PhoneticGenerator {
.iter() .iter()
.map(|s| s.to_string()) .map(|s| s.to_string())
.collect(); .collect();
let use_real_words = matching_words.len() >= MIN_REAL_WORDS; let pool_size = matching_words.len();
let use_dict = pool_size >= MIN_REAL_WORDS;
// Pre-categorize words into tiers for real-word mode // Hybrid ratio: linear interpolation between MIN_REAL_WORDS and FULL_DICT_THRESHOLD
let dict_ratio = if pool_size <= MIN_REAL_WORDS {
0.0
} else if pool_size >= FULL_DICT_THRESHOLD {
1.0
} else {
(pool_size - MIN_REAL_WORDS) as f64
/ (FULL_DICT_THRESHOLD - MIN_REAL_WORDS) as f64
};
// Scaled within-drill dedup window based on dictionary pool size
let dedup_window = if pool_size <= 20 {
pool_size.saturating_sub(1).max(4)
} else {
(pool_size / 4).min(20)
};
// Cross-drill history accept probability (computed once)
let cross_drill_accept_prob = if pool_size > 0 {
let pool_set: HashSet<&str> =
matching_words.iter().map(|s| s.as_str()).collect();
let history_in_pool = self
.cross_drill_history
.iter()
.filter(|w| pool_set.contains(w.as_str()))
.count();
let history_coverage = history_in_pool as f64 / pool_size as f64;
0.15 + 0.60 * history_coverage
} else {
1.0
};
// Pre-categorize words into tiers for dictionary picks
let bigram_str = focused_bigram.map(|b| format!("{}{}", b[0], b[1])); let bigram_str = focused_bigram.map(|b| format!("{}{}", b[0], b[1]));
let focus_char_lower = focused_char.filter(|ch| ch.is_ascii_lowercase()); let focus_char_lower = focused_char.filter(|ch| ch.is_ascii_lowercase());
let (bigram_indices, char_indices, other_indices) = if use_real_words { let (bigram_indices, char_indices, other_indices) = if use_dict {
let mut bi = Vec::new(); let mut bi = Vec::new();
let mut ci = Vec::new(); let mut ci = Vec::new();
let mut oi = Vec::new(); let mut oi = Vec::new();
@@ -356,21 +418,31 @@ impl TextGenerator for PhoneticGenerator {
let mut recent: Vec<String> = Vec::new(); let mut recent: Vec<String> = Vec::new();
for _ in 0..word_count { for _ in 0..word_count {
if use_real_words { let use_dict_word = use_dict && self.rng.gen_bool(dict_ratio);
if use_dict_word {
#[cfg(test)]
{
self.dict_picks += 1;
}
let word = self.pick_tiered_word( let word = self.pick_tiered_word(
&matching_words, &matching_words,
&bigram_indices, &bigram_indices,
&char_indices, &char_indices,
&other_indices, &other_indices,
&recent, &recent,
cross_drill_accept_prob,
); );
recent.push(word.clone()); recent.push(word.clone());
if recent.len() > 4 { if recent.len() > dedup_window {
recent.remove(0); recent.remove(0);
} }
words.push(word); words.push(word);
} else { } else {
let word = self.generate_phonetic_word(filter, focused_char, focused_bigram); let word = self.generate_phonetic_word(filter, focused_char, focused_bigram);
recent.push(word.clone());
if recent.len() > dedup_window {
recent.remove(0);
}
words.push(word); words.push(word);
} }
} }
@@ -394,6 +466,7 @@ mod tests {
table.clone(), table.clone(),
Dictionary::load(), Dictionary::load(),
SmallRng::seed_from_u64(42), SmallRng::seed_from_u64(42),
HashSet::new(),
); );
let focused_text = focused_gen.generate(&filter, Some('k'), None, 1200); let focused_text = focused_gen.generate(&filter, Some('k'), None, 1200);
let focused_count = focused_text let focused_count = focused_text
@@ -402,7 +475,7 @@ mod tests {
.count(); .count();
let mut baseline_gen = let mut baseline_gen =
PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42)); PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42), HashSet::new());
let baseline_text = baseline_gen.generate(&filter, None, None, 1200); let baseline_text = baseline_gen.generate(&filter, None, None, 1200);
let baseline_count = baseline_text let baseline_count = baseline_text
.split_whitespace() .split_whitespace()
@@ -425,6 +498,7 @@ mod tests {
table.clone(), table.clone(),
Dictionary::load(), Dictionary::load(),
SmallRng::seed_from_u64(42), SmallRng::seed_from_u64(42),
HashSet::new(),
); );
let bigram_text = bigram_gen.generate(&filter, None, Some(['t', 'h']), 1200); let bigram_text = bigram_gen.generate(&filter, None, Some(['t', 'h']), 1200);
let bigram_count = bigram_text let bigram_count = bigram_text
@@ -433,7 +507,7 @@ mod tests {
.count(); .count();
let mut baseline_gen = let mut baseline_gen =
PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42)); PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42), HashSet::new());
let baseline_text = baseline_gen.generate(&filter, None, None, 1200); let baseline_text = baseline_gen.generate(&filter, None, None, 1200);
let baseline_count = baseline_text let baseline_count = baseline_text
.split_whitespace() .split_whitespace()
@@ -453,7 +527,7 @@ mod tests {
let filter = CharFilter::new(('a'..='z').collect()); let filter = CharFilter::new(('a'..='z').collect());
let mut generator = let mut generator =
PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42)); PhoneticGenerator::new(table, Dictionary::load(), SmallRng::seed_from_u64(42), HashSet::new());
let text = generator.generate(&filter, Some('k'), Some(['t', 'h']), 200); let text = generator.generate(&filter, Some('k'), Some(['t', 'h']), 200);
let words: Vec<&str> = text.split_whitespace().collect(); let words: Vec<&str> = text.split_whitespace().collect();
@@ -474,4 +548,240 @@ mod tests {
"Max consecutive repeats = {max_consecutive}, expected <= 3" "Max consecutive repeats = {max_consecutive}, expected <= 3"
); );
} }
#[test]
fn cross_drill_history_suppresses_repeats() {
let dictionary = Dictionary::load();
let table = TransitionTable::build_from_words(&dictionary.words_list());
// Use a filter yielding a pool above FULL_DICT_THRESHOLD so dict_ratio=1.0
// (all words are dictionary picks, maximizing history suppression signal).
// Focus on 'k' to constrain the effective tier pool further.
let allowed: Vec<char> = "abcdefghijklmn ".chars().collect();
let filter = CharFilter::new(allowed);
// Use 200-word drills for stronger statistical signal
let word_count = 200;
// Drill 1: generate words and collect the set
let mut gen1 = PhoneticGenerator::new(
table.clone(),
Dictionary::load(),
SmallRng::seed_from_u64(100),
HashSet::new(),
);
let text1 = gen1.generate(&filter, Some('k'), None, word_count);
let words1: HashSet<String> = text1.split_whitespace().map(|w| w.to_string()).collect();
// Drill 2 without history (baseline)
let mut gen2_no_hist = PhoneticGenerator::new(
table.clone(),
Dictionary::load(),
SmallRng::seed_from_u64(200),
HashSet::new(),
);
let text2_no_hist = gen2_no_hist.generate(&filter, Some('k'), None, word_count);
let words2_no_hist: HashSet<String> =
text2_no_hist.split_whitespace().map(|w| w.to_string()).collect();
let baseline_intersection = words1.intersection(&words2_no_hist).count();
let baseline_union = words1.union(&words2_no_hist).count();
let baseline_jaccard = baseline_intersection as f64 / baseline_union as f64;
// Drill 2 with history from drill 1
let mut gen2_with_hist = PhoneticGenerator::new(
table.clone(),
Dictionary::load(),
SmallRng::seed_from_u64(200),
words1.clone(),
);
let text2_with_hist = gen2_with_hist.generate(&filter, Some('k'), None, word_count);
let words2_with_hist: HashSet<String> =
text2_with_hist.split_whitespace().map(|w| w.to_string()).collect();
let hist_intersection = words1.intersection(&words2_with_hist).count();
let hist_union = words1.union(&words2_with_hist).count();
let hist_jaccard = hist_intersection as f64 / hist_union as f64;
// With seeds 100/200 and filter "abcdefghijklmn", 200-word drills:
// baseline_jaccard≈0.31, hist_jaccard≈0.13, reduction≈0.18
assert!(
baseline_jaccard - hist_jaccard >= 0.15,
"History should reduce overlap by at least 0.15: baseline_jaccard={baseline_jaccard:.3}, \
hist_jaccard={hist_jaccard:.3}, reduction={:.3}",
baseline_jaccard - hist_jaccard,
);
}
#[test]
fn hybrid_mode_produces_mixed_output() {
let dictionary = Dictionary::load();
let table = TransitionTable::build_from_words(&dictionary.words_list());
// Use a constrained filter to get a pool in the hybrid range (8-60).
let allowed: Vec<char> = "abcdef ".chars().collect();
let filter = CharFilter::new(allowed);
let matching: HashSet<String> = dictionary
.find_matching(&filter, None)
.iter()
.map(|s| s.to_string())
.collect();
let match_count = matching.len();
// Verify pool is in hybrid range
assert!(
match_count >= MIN_REAL_WORDS && match_count < FULL_DICT_THRESHOLD,
"Expected pool in hybrid range ({MIN_REAL_WORDS}-{FULL_DICT_THRESHOLD}), got {match_count}"
);
let mut generator = PhoneticGenerator::new(
table,
Dictionary::load(),
SmallRng::seed_from_u64(42),
HashSet::new(),
);
let text = generator.generate(&filter, None, None, 500);
let words: Vec<&str> = text.split_whitespace().collect();
let dict_count = words.iter().filter(|w| matching.contains(**w)).count();
let dict_pct = dict_count as f64 / words.len() as f64;
// dict_ratio = (22-8)/(60-8) ≈ 0.27. Phonetic words generated by
// the Markov chain often coincidentally match dictionary entries, so
// observed dict_pct exceeds the intentional dict_ratio.
// With seed 42 and filter "abcdef" (pool=22): observed dict_pct ≈ 0.59
assert!(
dict_pct >= 0.25 && dict_pct <= 0.65,
"Dict word percentage {dict_pct:.2} (count={dict_count}/{}, pool={match_count}) \
outside expected 25%-65% range",
words.len()
);
// Verify it's actually mixed: not all dictionary and not all phonetic
assert!(
dict_count > 0 && dict_count < words.len(),
"Expected mixed output, got dict_count={dict_count}/{}",
words.len()
);
}
#[test]
fn boundary_phonetic_only_below_threshold() {
let dictionary = Dictionary::load();
let table = TransitionTable::build_from_words(&dictionary.words_list());
// Very small filter — should yield < MIN_REAL_WORDS (8) dictionary matches.
// With pool < MIN_REAL_WORDS, use_dict=false so 0% intentional dictionary
// selections (the code never enters pick_tiered_word).
let allowed: Vec<char> = "xyz ".chars().collect();
let filter = CharFilter::new(allowed);
let matching: Vec<String> = dictionary
.find_matching(&filter, None)
.iter()
.map(|s| s.to_string())
.collect();
assert!(
matching.len() < MIN_REAL_WORDS,
"Expected < {MIN_REAL_WORDS} matches, got {}",
matching.len()
);
let mut generator = PhoneticGenerator::new(
table,
Dictionary::load(),
SmallRng::seed_from_u64(42),
HashSet::new(),
);
let text = generator.generate(&filter, None, None, 50);
let words: Vec<&str> = text.split_whitespace().collect();
assert!(
!words.is_empty(),
"Should generate non-empty output even with tiny filter"
);
// Verify the dictionary selection path was never taken (0 intentional picks).
// Phonetic words may coincidentally match dictionary entries, but the
// dict_picks counter only increments when the dictionary branch is chosen.
assert_eq!(
generator.dict_picks, 0,
"Below threshold: expected 0 intentional dictionary picks, got {}",
generator.dict_picks
);
}
#[test]
fn boundary_full_dict_above_threshold() {
let dictionary = Dictionary::load();
let table = TransitionTable::build_from_words(&dictionary.words_list());
// Full alphabet — should yield 100+ dictionary matches
let filter = CharFilter::new(('a'..='z').collect());
let matching: HashSet<String> = dictionary
.find_matching(&filter, None)
.iter()
.map(|s| s.to_string())
.collect();
assert!(
matching.len() >= FULL_DICT_THRESHOLD,
"Expected >= {FULL_DICT_THRESHOLD} matches, got {}",
matching.len()
);
// With pool >= FULL_DICT_THRESHOLD, dict_ratio=1.0 and gen_bool(1.0)
// always returns true, so every word goes through pick_tiered_word.
// All picks come from matching_words → 100% dictionary.
let mut generator = PhoneticGenerator::new(
table,
Dictionary::load(),
SmallRng::seed_from_u64(42),
HashSet::new(),
);
let text = generator.generate(&filter, None, None, 200);
let words: Vec<&str> = text.split_whitespace().collect();
let dict_count = words.iter().filter(|w| matching.contains(**w)).count();
assert_eq!(
dict_count,
words.len(),
"Above threshold: expected 100% dictionary words, got {dict_count}/{}",
words.len()
);
}
#[test]
fn weighted_suppression_graceful_degradation() {
let dictionary = Dictionary::load();
let table = TransitionTable::build_from_words(&dictionary.words_list());
// Use a small filter to get a small pool
let allowed: Vec<char> = "abcdefghijk ".chars().collect();
let filter = CharFilter::new(allowed);
let matching: Vec<String> = dictionary
.find_matching(&filter, None)
.iter()
.map(|s| s.to_string())
.collect();
// Create history containing most of the pool words (up to 8)
let history: HashSet<String> = matching.iter().take(8.min(matching.len())).cloned().collect();
let mut generator = PhoneticGenerator::new(
table,
Dictionary::load(),
SmallRng::seed_from_u64(42),
history.clone(),
);
let text = generator.generate(&filter, None, None, 50);
let words: Vec<&str> = text.split_whitespace().collect();
// Should not panic and should produce output
assert!(!words.is_empty(), "Should generate non-empty output");
// History words should still appear (suppression is soft, not hard exclusion)
let history_words_in_output: usize = words
.iter()
.filter(|w| history.contains(**w))
.count();
// With soft suppression, at least some history words should appear
// (they're accepted with reduced probability, not blocked)
assert!(
history_words_in_output > 0 || matching.len() > history.len(),
"History words should still appear with soft suppression, or non-history pool words used"
);
}
} }