Voice control and speech-to-text for the Linux desktop.
Speak to type. Speak to command. Fully offline. Fully private.
Early Development — This project is in an early stage. Features may change, break, or be incomplete. Contributions and feedback are welcome.
KaleoCtrl turns your voice into actions on the Linux desktop. It captures speech through your microphone, converts it to text using a local whisper.cpp model, and either types the text into any active application or executes system commands — all without sending a single byte to the cloud.
KaleoCtrl is being developed and tested on the following system:
| Component | Specification |
|---|---|
| OS | EndeavourOS (Arch-based) — Kernel 6.19 |
| CPU | Intel Core i9-13900H (20 threads) |
| RAM | 64 GB DDR5 |
| GPU | Intel Iris Xe Graphics (RPL-P) — integrated |
| Display Server | Wayland (KDE Plasma) |
KaleoCtrl runs whisper.cpp with Vulkan backend to leverage the Intel Iris Xe integrated GPU for inference. This means no dedicated NVIDIA/AMD GPU is required — the model runs accelerated on the iGPU that's already in your laptop.
The setup:
features = ["vulkan"]Intel open-source Mesa driver) provides the GPU compute layerlarge-v3-turbo (809M params, GGML format) — best balance of accuracy and speedLimitation: Real-time simultaneous speech-to-text is not achievable on this hardware. The Intel Iris Xe iGPU lacks the compute power for true simultaneous transcription — there is a noticeable delay between speaking and text output. For low-latency real-time STT, a dedicated GPU (e.g. NVIDIA with CUDA) is recommended.
Smaller models (small, quantized q5_0) are available for systems with less GPU memory or processing power and can reduce latency at the cost of accuracy.
|
The main dashboard — real-time application state at a glance.
All values update live. The status panel refreshes automatically every 2 seconds and reacts instantly to backend events. |
|
Configure core application settings. All changes apply instantly — no save button, no restart.
A "Saved" indicator briefly appears after each change to confirm the setting was applied. |
|
Define voice keywords that map to actions. Fully editable, per-language, auto-saved.
Each entry shows the internal action on the left and the spoken keyword on the right. Click x to remove, or use the input row at the bottom to add new mappings. |
|
Map spoken words to keyboard keys and view operating modes.
Key commands bridge the gap between voice and keyboard — any key that can be typed can be triggered by voice. |
|
The user said "keyboard" — KaleoCtrl is now waiting for a key name.
The orange box pulses to clearly signal that the app is waiting for the next spoken word to be interpreted as a key press. |
|
The user said "enter" — the Return key was pressed.
The entire flow — say "keyboard", wait for prompt, say "enter" — takes under 2 seconds. Any mapped key can be triggered this way. |
Microphone → Audio Capture (cpal) → Voice Activity Detection
↓
Streaming Transcriber
(whisper.cpp FFI)
↓
Speech Recognition
↓
┌───────────────────┼───────────────────┐
↓ ↓ ↓
Command Parser Key Command Text Injection
(open, close, (keyboard + key) (type into active
switch, etc.) application)
↓ ↓ ↓
System Actions Key Press Sim. Text Output
(wmctrl, xdg) (xdotool/wtype) (wtype/xdotool)
"<name> mode desktop")"<name> wake up" / "<name> sleep")"open firefox", "close window")"keyboard enter", "keyboard tab")"new line", "delete word")| Mode | Behavior |
|---|---|
| Desktop | System commands are active directly — say "open firefox" without any prefix |
| Dictation | Pure speech-to-text. Everything you say gets typed. Commands require the assistant name as prefix (e.g. "prometheus open firefox") |
| Terminal | Voice commands are sent to the terminal |
| Sleep | KaleoCtrl is paused. Only listens for the wake phrase |
Switch modes by saying: "<assistant_name> mode <mode_name>" — e.g. "prometheus mode dictation"
You assign a custom name to your assistant (default: "prometheus"). This name acts as a global command prefix and works in every mode:
"<name> wake up" is recognizedThis prevents false triggers — in dictation mode, saying "open the file" just types that text, while "prometheus open firefox" executes the command.
KaleoCtrl supports any language that whisper.cpp can transcribe. Keywords are defined in per-language JSON files:
config/keywords_en.json # English keywords
config/keywords_de.json # German keywords
config/keywords_<lang>.json # Add your own
To add a new language, create a keyword file following the same schema and select it in settings. The STT model (large-v3-turbo) supports 99+ languages natively.
| Component | Technology |
|---|---|
| Framework | Tauri v2 |
| Backend | Rust |
| Frontend | Svelte + CSS |
| STT Engine | whisper.cpp (via Rust FFI) |
| Default Model | whisper-large-v3-turbo (GGML, 809M params) |
| Audio Capture | CPAL |
| GPU Support | CUDA, Vulkan, Metal, SYCL |
| Config | JSON |
The install script detects your distribution and handles everything:
# Prerequisites: git, curl
git clone https://github.com/Maik-0000FF/KaleoCtrl.git
cd KaleoCtrl
chmod +x install.sh
./install.sh
The script offers three modes:
cargo tauri devSupported distributions: Arch/EndeavourOS/Manjaro, Ubuntu/Debian/Mint, Fedora/Nobara, openSUSE
After installation, open Settings > Model Manager in the app to download a whisper model.
If you prefer to install dependencies yourself:
npm install # frontend dependencies
cargo tauri dev # development mode (hot reload)
cargo tauri build # production build
cargo check # type-check
cargo clippy # lint
cargo test # run tests
All config files live in config/:
config/settings.json{
"assistant_name": "prometheus",
"language": "de",
"stt_model": "large-v3-turbo",
"default_mode": "dictation"
}
config/keywords_<lang>.jsonContains per-language definitions for:
initial_prompt and post-processing (not model fine-tuning)This project is licensed under the PolyForm Noncommercial License 1.0.0.
You are free to use, modify, and share this software for any noncommercial purpose — personal use, research, education, hobby projects. Commercial use (including selling or embedding in commercial products) requires explicit permission from the author.