Welcome to the VUI repository! Here, you will find small conversational speech models designed to run directly on your device. Our models are lightweight yet powerful, enabling engaging interactions without needing extensive server resources.
To get started with VUI, you need to install the package. Use the following command:
pip install -e .
This command installs the VUI package in editable mode, allowing you to make changes and test them quickly.
Want to see VUI in action? You can try it out on Gradio:
To run the demo locally, execute the following command:
python demo.py
This will start the demo, allowing you to interact with the models directly on your machine.
VUI offers several models, each with unique capabilities:
Vui.BASE: This is the foundational checkpoint, trained on 40,000 hours of audio conversations. It serves as a robust starting point for various applications.
Vui.ABRAHAM: This model features a single speaker that can respond with context awareness. It understands previous interactions, making conversations feel more natural.
Vui.COHOST: This checkpoint includes two speakers that can converse with each other. It simulates dialogues, enhancing the conversational experience.
You can use the base model for voice cloning. While it performs reasonably well, it is important to note that it may not be perfect. The model has limited exposure to audio data and wasn't trained for an extended period. Therefore, while you can achieve good results, you might encounter some limitations.
VUI is built on a llama-based transformer architecture. It predicts audio tokens, enabling it to generate speech based on the input it receives. This research aims to push the boundaries of what is possible with on-device conversational AI.
Here are some frequently asked questions regarding VUI:
What hardware was used for development?
The models were developed using two NVIDIA 4090 GPUs. You can check out more details on this Twitter post.
Does the model hallucinate?
Yes, the model does experience hallucinations. Despite these challenges, it is the best version we could create with the resources available.
Why does Voice Activity Detection (VAD) slow things down?
VAD is essential for removing areas of silence, which improves the overall quality of the interaction. However, it does introduce some latency.
If you wish to cite this work, please use the following reference:
@software{vui_2025,
author = {Coultas Blum, Harry},
month = {01},
}
For the latest versions and updates, please visit our Releases section. You can download the necessary files and execute them to get started with the latest features.
Here are some images that represent the essence of VUI:
For more detailed information on using VUI, consider checking the following resources:
If you encounter issues or have questions, feel free to reach out. You can create an issue in the repository, and we will assist you as soon as possible.
For updates and further information, visit our Releases section regularly.
Thank you for your interest in VUI! We hope you enjoy using our conversational speech models and find them useful for your projects.