Voice control and speech-to-text for the Linux desktop.
Dictation and voice-triggered actions. Audio processing runs locally.
[!WARNING] Early Development — This project is in an early stage. Features may change, break, or be incomplete. Contributions and feedback are welcome.
KaleoCtrl captures speech through the microphone, converts it to text using a local whisper.cpp model, and either types the text into the active application or executes system commands. Transcription happens locally; no audio is sent to a network service.
KaleoCtrl is being developed and tested on the following system:
| Component | Specification |
|---|---|
| OS | EndeavourOS (Arch-based) — Kernel 6.19 |
| CPU | Intel Core i9-13900H (20 threads) |
| RAM | 64 GB DDR5 |
| GPU | Intel Iris Xe Graphics (RPL-P) — integrated |
| Display Server | Wayland (KDE Plasma) |
KaleoCtrl runs whisper.cpp with the Vulkan backend so inference can use the Intel Iris Xe integrated GPU. A dedicated NVIDIA or AMD GPU is not required for this setup.
The setup:
features = ["vulkan"]Intel open-source Mesa driver) provides the GPU compute layerlarge-v3-turbo (809M params, GGML format) — chosen as accuracy/speed compromiseLimitation: Real-time simultaneous speech-to-text is not achievable on this hardware. The Intel Iris Xe iGPU lacks the compute power for true simultaneous transcription — there is a noticeable delay between speaking and text output. For low-latency real-time STT, a dedicated GPU (e.g. NVIDIA with CUDA) is recommended.
Smaller models (small, quantized q5_0) are available for systems with less GPU memory or processing power and can reduce latency at the cost of accuracy.
|
Main dashboard showing current application state.
Values are polled every 2 seconds and also updated on backend events. |
|
Core application settings. Changes are written to
A "Saved" indicator appears briefly after each change. |
|
Voice keywords that map to actions. Editable per language; changes are written to the keyword JSON file.
Each entry shows the internal action on the left and the spoken keyword on the right. Click x to remove an entry; use the input row at the bottom to add a new mapping. |
|
Mapping from spoken words to keyboard keys, plus an overview of operating modes.
Any key listed in the |
|
The user said "keyboard" — KaleoCtrl is waiting for a key name.
The orange box pulses while the app waits for the next spoken word, which will be interpreted as a key press. |
|
The user said "enter" — the Return key was pressed.
Total latency depends on hardware and the loaded model; on the reference setup above the prefix-plus-key sequence completes in a few seconds. |
Microphone → Audio Capture (cpal) → Voice Activity Detection
↓
Streaming Transcriber
(whisper.cpp FFI)
↓
Killswitch check ─── (emergency stop, any mode)
↓
Command Planner (pure, unit-testable)
text + mode + keywords → Action
↓
Command Executor (trait)
↓
┌───────────────────┼───────────────────┐
↓ ↓ ↓
System Command Key Command Text Injection
(open, close, (key prefix + (type into active
switch, etc.) key name) application)
↓ ↓ ↓
System Actions Key Press Sim. Text Output
(wmctrl, xdg) (xdotool/wtype) (wtype/xdotool)
The planner (commander::plan) is a pure function: same input produces the same output, with no side effects. The executor is behind a CommandExecutor trait, which allows the planner to be exercised with a mock executor in tests.
"<name> mode desktop")"<name> wake up" / "<name> sleep")"open firefox", "close window")"keyboard enter", "keyboard tab")"new line", "delete word")| Mode | Behavior |
|---|---|
| Desktop | System commands are active directly — "open firefox" works without prefix |
| Dictation | Speech is transcribed and typed. Commands require the assistant name as prefix (e.g. "Kaleo open firefox") |
| Terminal | Voice commands are sent to the terminal |
| Sleep | KaleoCtrl is paused. Only the wake phrase is recognized |
Switch modes by saying: "<assistant_name> mode <mode_name>" — e.g. "Kaleo mode dictation"
The assistant has a configurable name (default: "Kaleo"). This name acts as a global command prefix and is recognized in every mode:
"<name> wake up" is recognizedThis separates dictation from commands — in dictation mode, "open the file" is typed verbatim, while "Kaleo open firefox" is executed as a command.
If KaleoCtrl gets stuck waiting for a key, starts injecting unwanted text, or a hard stop is needed, say the killswitch phrase (default: "killswitch").
The phrase is checked before any other command parsing and is recognized in every mode, including sleep:
sleep mode so commands do not fire if audio is restarted from the UIkillswitch_triggered event so the UI can reflect the new stateThe phrase is configured per language under killswitch_phrase in keywords_<lang>.json.
KaleoCtrl works with any language whisper.cpp can transcribe. Keywords are defined in per-language JSON files:
config/keywords_en.json # English keywords
config/keywords_de.json # German keywords
config/keywords_<lang>.json # Add your own
To add a new language, create a keyword file following the same schema and select it in settings. The default STT model (large-v3-turbo) covers the languages listed in its model card.
| Component | Technology |
|---|---|
| Framework | Tauri v2 |
| Backend | Rust |
| Frontend | Svelte + CSS |
| STT Engine | whisper.cpp (via Rust FFI) |
| Default Model | whisper-large-v3-turbo (GGML, 809M params) |
| Audio Capture | CPAL |
| GPU Support | CUDA, Vulkan, Metal, SYCL |
| Config | JSON |
install.sh detects the distribution and installs the required packages:
# Prerequisites: git, curl
git clone https://github.com/Maik-0000FF/KaleoCtrl.git
cd KaleoCtrl
./install.sh
The script provides three modes:
cargo tauri devSupported distributions: Arch/EndeavourOS/Manjaro, Ubuntu/Debian/Mint, Fedora/Nobara, openSUSE
After installation, open Settings > Model Manager in the app to download a whisper model.
A matching uninstall.sh is shipped alongside the installer:
./uninstall.sh
Three modes; each step is idempotent and requires interactive confirmation:
~/.local/bin/kaleoctrl), icons, and desktop entrymodels/, node_modules/, dist/, and src-tauri/target/ (each prompted individually)config/ and the system packages installed by install.sh (with warnings)The uninstall script does not modify the Rust toolchain or Node.js installation and does not delete any path without a confirmation prompt.
If you prefer to install dependencies yourself:
npm install # frontend dependencies
cargo tauri dev # development mode (hot reload)
cargo tauri build # production build
cargo check # type-check
cargo clippy # lint
cargo test # run tests
All config files live in config/:
config/settings.json{
"assistant_name": "Kaleo",
"language": "de",
"stt_model": "large-v3-turbo",
"default_mode": "dictation"
}
config/keywords_<lang>.jsonContains per-language definitions for:
language — language codemodes — mode names and descriptions (desktop, dictation, terminal, sleep)mode_switch — the word that triggers mode changes (e.g. "mode" / "modus")wake_phrase / sleep_phrase — wake/sleep the assistantkillswitch_phrase — emergency stop, recognized in any mode (default: "killswitch")commands — system command keywords (open, close, switch, minimize, maximize, fullscreen, stop, undo)dictation — dictation control keywords (new_line, new_paragraph, delete_word, delete_sentence, select_all, copy, paste, cut)key_prefix — word that activates key-press mode (default: "key" in en, "keyboard" in de)key_prefix_aliases — additional words that also activate key-press modekeys — mapping from spoken word to keyboard key name (e.g. "enter" → "Return")initial_prompt and post-processing (not model fine-tuning)This project is licensed under the PolyForm Noncommercial License 1.0.0.
You are free to use, modify, and share this software for any noncommercial purpose — personal use, research, education, hobby projects. Commercial use (including selling or embedding in commercial products) requires explicit permission from the author.