AI-powered audio processing platform for multi-file audio analysis, cross-recording speaker tracking, and intelligent audio extraction. Transforms audio files into structured, searchable content using state-of-the-art ML models.
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenAudio/main/setup-openaudio.sh | bash
Downloads compose files from GitHub, pulls pre-built images from DockerHub, downloads ML models via containers. No git clone, no local Python, no building.
git clone https://github.com/davidamacey/OpenAudio.git
cd OpenAudio
./setup-openaudio.sh
./openaudio.sh start
Access the application:
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenAudio/main/setup-openaudio.sh | bash
git clone https://github.com/davidamacey/OpenAudio.git
cd OpenAudio
./setup-openaudio.sh
Both options will:
.env from template with auto-generated secure passwords# 1. Copy environment template
cp .env.example .env
# 2. Edit .env - at minimum set secure passwords for:
# POSTGRES_PASSWORD, MINIO_ROOT_PASSWORD, REDIS_PASSWORD,
# JWT_SECRET_KEY, ENCRYPTION_KEY
nano .env
# 3. Create directories
mkdir -p models/{huggingface,torch,beats,modelscope} data/{uploads,outputs,exports} logs backups
# 4. Start services
./openaudio.sh start
# 5. Download models
./scripts/download-models.sh
Note: MinIO buckets are automatically created by the backend on startup.
# Start services
./openaudio.sh start # Development mode
./openaudio.sh start prod # Production mode
./openaudio.sh start gpu # With GPU support
# Service management
./openaudio.sh stop # Stop all services
./openaudio.sh restart # Restart services
./openaudio.sh status # Show service status
./openaudio.sh logs [service] # View logs
./openaudio.sh health # Health check
# Database operations
./openaudio.sh db migrate # Run migrations
./openaudio.sh db backup # Create backup
./openaudio.sh db restore FILE # Restore from backup
./openaudio.sh db reset # Reset database (WARNING: deletes data)
# Development
./openaudio.sh shell backend # Open backend shell
./openaudio.sh shell db # Open database shell
./openaudio.sh build # Rebuild containers
./openaudio.sh rebuild SERVICE # Rebuild specific service
# Model management
./openaudio.sh download-models # Download ML models (via Docker)
# Image management (from cloned repo only)
./openaudio.sh push # Build + push images to DockerHub
./openaudio.sh push --version=0.1.0 # Tag with version
# Maintenance
./openaudio.sh clean # Clean up Docker resources
./openaudio.sh purge # Remove all data (destructive)
# Create a collection
curl -X POST "http://localhost:5474/api/v1/collections" \
-H "Content-Type: application/json" \
-d '{"name": "My Collection"}'
# Upload audio files
curl -X POST "http://localhost:5474/api/v1/collections/{collection_id}/files" \
-F "[email protected]" -F "[email protected]"
# Run full analysis pipeline on all files
curl -X POST "http://localhost:5474/api/v1/analysis/full?collectionId={collection_id}"
# Run speaker diarization on a single file
curl -X POST "http://localhost:5474/api/v1/analysis/diarize/{file_id}" \
-H "Content-Type: application/json" \
-d '{"extract_embeddings": true}'
# Separate audio with a text prompt
curl -X POST "http://localhost:5474/api/v1/analysis/separate/{file_id}" \
-H "Content-Type: application/json" \
-d '{"text_prompt": "singing voice"}'
# Search for events across a collection
curl -X POST "http://localhost:5474/api/v1/analysis/search?query=speech&collectionId={collection_id}"
# Export a time region
curl -X POST "http://localhost:5474/api/v1/export/region/{file_id}?start_time=10&end_time=30&format=mp3"
+------------------+
| Frontend |
| (SvelteKit) |
| :5473 |
+--------+---------+
|
v
+------------------+ +----------+---------+ +------------------+
| MinIO |<------------>| Backend API |<------------>| PostgreSQL |
| (Object Storage) | | (FastAPI) | | (Database) |
| :5478 | | :5474 | | :5476 |
+------------------+ +----------+---------+ +------------------+
|
+------------+-----------+-----------+------------+
| | | |
v v v v
+-------+------+ +---+--------+ +------+------+ +---+--------+
| Torch Worker | | SAM Worker | | Enhance Wkr | | CPU Worker |
| (GPU queue) | | (SAM queue)| |(enhance q) | | (cpu queue)|
+-------+------+ +---+--------+ +------+------+ +---+--------+
| | | |
v v v v
+-------+------+ +---+--------+ +------+------+ +---+--------+
| PyAnnote | | SAM Audio | | DeepFilter | | Silero VAD |
| BEATs/YAMNet | | (Separate) | | (Enhance) | | Alignment |
| AST/Emotion | | | | | | Features |
+-------+------+ +------------+ +-------------+ +------------+
| Service | Port | Description |
|---|---|---|
| Frontend | 5473 | SvelteKit 2 + Svelte 5 web application |
| Backend API | 5474 | FastAPI REST API |
| Flower | 5475 | Celery task monitoring dashboard |
| PostgreSQL | 5476 | Primary database (pgvector) |
| Redis | 5477 | Celery task queue broker |
| MinIO | 5478/5479 | Object storage (API/Console) |
Pre-built images are published to DockerHub under davidamacey/:
| Image | Dockerfile | Used by |
|---|---|---|
openaudio-backend |
backend/Dockerfile |
backend, cpu-worker, beat, flower |
openaudio-torch |
backend/Dockerfile.torch |
celery-torch-worker |
openaudio-sam |
backend/Dockerfile.sam |
celery-sam-worker |
openaudio-enhance |
backend/Dockerfile.enhance |
celery-enhance-worker |
openaudio-frontend |
frontend/Dockerfile.prod |
frontend |
Pin a version by setting IMAGE_TAG=0.1.0 in .env (default: latest).
| Model | Purpose | VRAM | Queue |
|---|---|---|---|
| PyAnnote v4 | Speaker diarization + overlap detection | ~2 GB | gpu |
| WeSpeaker | 256-dim speaker embeddings | (with PyAnnote) | gpu |
| YAMNet (PyTorch) | Sound event classification (521 classes) | <1 GB | gpu |
| BEATs (Microsoft) | Audio event classification (527 classes) | ~400 MB | gpu |
| AST | Audio spectrogram classification (527 classes) | ~1 GB | gpu |
| Emotion2Vec+ Large | Speech emotion detection (8 emotions) | ~1 GB | gpu |
| SAM Audio Base (Lite) | Text-guided source separation | ~5 GB | sam |
| SAM Audio Small (Lite) | Text-guided source separation | ~3.5 GB | sam |
| DeepFilterNet | Audio enhancement / noise reduction | ~500 MB | enhance |
| Silero VAD v6 | Voice activity detection (ONNX, CPU-only) | 0 GB | cpu |
| Queue | Concurrency | Tasks |
|---|---|---|
| gpu | 1 | Diarization, YAMNet/BEATs/AST event detection, emotion detection, speaker embedding extraction |
| sam | 1 | SAM Audio text-guided source separation (warm-start preloading, chunked processing) |
| enhance | 4 | DeepFilterNet noise reduction |
| cpu | 8 | Silero VAD, cross-correlation alignment, LibROSA feature extraction, speaker clustering, audio metrics |
| export | 2 | Region/speaker/event/collection/multi-region exports |
OpenAudio uses a "Lite Mode" for Meta's SAM Audio model that removes unused video-related components to enable audio-only inference on consumer GPUs:
| Component Removed | VRAM Saved | Why Not Needed |
|---|---|---|
| Vision Encoder | ~2 GB | Video frame encoding |
| Visual Ranker | ~2 GB | Video-based source ranking |
| Text Ranker | ~2 GB | Text-based reranking |
| Span Predictor | ~1-2 GB | Temporal span prediction |
Total Reduction: 11 GB → 4-5 GB (~55% savings), enabling RTX 3060/4060 (6GB) GPUs.
Additional runtime optimizations:
worker_prefetch_multiplier=1) prevents GPU memory contentionInteractive API documentation is available at:
(Disabled in production mode when ENVIRONMENT=production)
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/analysis/models |
List available ML models (filter by type) |
| POST | /api/v1/analysis/full |
Run full pipeline on collection (VAD + diarization + 3x event detection + emotion) |
| POST | /api/v1/analysis/diarize/{file_id} |
Speaker diarization (PyAnnote v4) |
| POST | /api/v1/analysis/detect-events/{file_id} |
Sound event detection (YAMNet, BEATs, or AST) |
| POST | /api/v1/analysis/detect-emotions/{file_id} |
Emotion detection (Emotion2Vec+) |
| POST | /api/v1/analysis/vad/{file_id} |
Voice activity detection (Silero VAD) |
| POST | /api/v1/analysis/separate/{file_id} |
Text-guided source separation (SAM Audio) |
| POST | /api/v1/analysis/enhance/{file_id} |
Audio enhancement (DeepFilterNet) |
| POST | /api/v1/analysis/features/{file_id} |
Audio feature extraction (LibROSA) |
| POST | /api/v1/analysis/cluster-speakers |
Cross-file speaker clustering |
| POST | /api/v1/analysis/alignment |
Automatic multi-track alignment |
| POST | /api/v1/analysis/search |
Semantic search (event labels + SAM separation) |
| GET | /api/v1/analysis/diarization/{file_id}/results |
Get speaker segments and overlaps |
| GET | /api/v1/analysis/events/{file_id}/results |
Get detected events (filter by model) |
| GET | /api/v1/analysis/emotions/{file_id}/results |
Get emotion segments |
| GET | /api/v1/analysis/vad/{file_id}/results |
Get speech/silence segments |
| GET | /api/v1/analysis/separation/{file_id}/results |
Get separation audio (presigned URLs) |
| GET | /api/v1/analysis/enhancement/{file_id}/results |
Get enhanced audio (presigned URLs) |
| DELETE | /api/v1/analysis/separation/{task_id} |
Delete separation result and audio |
| DELETE | /api/v1/analysis/enhancement/{task_id} |
Delete enhancement result and audio |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/collections |
Create collection |
| GET | /api/v1/collections |
List collections (paginated, searchable) |
| GET | /api/v1/collections/{id} |
Get collection with files, speakers, events |
| PUT | /api/v1/collections/{id} |
Update collection |
| DELETE | /api/v1/collections/{id} |
Delete collection |
| POST | /api/v1/collections/{id}/files |
Upload audio files |
| GET | /api/v1/files/{id} |
Get file metadata |
| DELETE | /api/v1/files/{id} |
Delete file |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/export/region/{file_id} |
Export time region (WAV/MP3/FLAC/OGG) |
| POST | /api/v1/export/speaker/{speaker_id} |
Export speaker segments (merged or separate) |
| POST | /api/v1/export/event/{event_id} |
Export event audio with padding |
| POST | /api/v1/export/collection/{collection_id} |
Export full collection as ZIP |
| POST | /api/v1/export/multi-region |
Export same region from multiple files (separate or mixed) |
| GET | /api/v1/export/download/{task_id} |
Download completed export |
| Method | Endpoint | Description |
|---|---|---|
| GET/PUT | /api/v1/speakers/{id} |
Get/update speaker (rename, merge) |
| POST | /api/v1/speakers/merge |
Merge multiple speakers |
| GET | /api/v1/tasks/{id} |
Get task status and progress |
| DELETE | /api/v1/tasks/{id} |
Cancel a running task |
| POST | /api/v1/auth/register |
User registration |
| POST | /api/v1/auth/login |
User login (JWT) |
OpenAudio/
├── backend/
│ ├── app/
│ │ ├── api/v1/ # API endpoints (analysis, collections, export, files, speakers, tasks, auth)
│ │ ├── core/ # Configuration, security, Celery, constants
│ │ ├── db/models/ # SQLAlchemy models (collection, audio_file, speaker, speaker_segment,
│ │ │ # sound_event, emotion_segment, vad_segment, task)
│ │ ├── ml/models/ # ML model wrappers (pyannote_v4, sam_audio, yamnet, beats, ast,
│ │ │ # emotion2vec, deepfilter, silero_vad, audio_features)
│ │ ├── ml/registry.py # Model registry with VRAM tracking and auto-unloading
│ │ ├── schemas/ # Pydantic request/response schemas
│ │ └── services/ # Business logic (collection, alignment, minio, opensearch)
│ ├── workers/
│ │ ├── gpu_tasks.py # GPU: diarization, YAMNet, BEATs, AST, emotion, embeddings
│ │ ├── sam_tasks.py # SAM: source separation with warm-start preloading
│ │ ├── enhance_tasks.py # Enhancement: DeepFilterNet noise reduction
│ │ ├── cpu_tasks.py # CPU: VAD, alignment, features, clustering, metrics
│ │ └── export_tasks.py # Export: region, speaker, event, collection, multi-region
│ ├── alembic/ # Database migrations
│ └── tests/ # Backend tests (unit, auth, ML pipeline integration)
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ │ ├── waveform/ # MultiTrackEditor, WaveformTrack, SeparationTrack, EnhanceTrack,
│ │ │ │ # RegionSelector, AlignmentControls, SyncDialog, TimeRuler
│ │ │ ├── timeline/ # UnifiedTimeline, AnalysisRow, SearchRow, SpeakerLabelEditor
│ │ │ ├── speakers/ # SpeakerList, SpeakerTimeline, SpeakerBadge
│ │ │ ├── analysis/ # AnalysisPanel, ModelSelector, SeparationPromptInput
│ │ │ ├── export/ # ExportDialog
│ │ │ └── common/ # Header, Modal, Toast, Button, ProgressBar, AuthGuard
│ │ ├── lib/
│ │ │ ├── stores/ # Svelte 5 rune stores (playback, collections, timeline, selection,
│ │ │ │ # tasks, files, separations, enhancements, auth, settings)
│ │ │ ├── api/ # API clients (analysis, collections, export, files, auth)
│ │ │ ├── types/ # TypeScript types (collection, task, speaker, separation, enhancement)
│ │ │ └── utils/ # Timeline, time, transform, waveform, audioExtraction utilities
│ │ └── routes/ # Pages: landing, login, register, library, collection workspace
│ └── static/ # Static assets
├── scripts/ # Shell scripts (setup, model download, image build/push)
├── models/ # ML model cache (host-mounted)
├── data/ # Uploaded/processed files
├── docker-compose.yml # Base Docker services
├── docker-compose.override.yml # Development overrides
├── docker-compose.prod.yml # Production overrides
├── docker-compose.gpu.yml # GPU runtime overlay
├── docker-compose.offline.yml # Offline mode
├── openaudio.sh # Management CLI
└── setup-openaudio.sh # First-time setup
| Layer | Technology |
|---|---|
| Frontend | SvelteKit 2 + Svelte 5 (runes), TypeScript, Tailwind CSS, WaveSurfer.js |
| Backend | Python 3.11+, FastAPI, SQLAlchemy, Pydantic, Alembic |
| Task Queue | Celery + Redis, specialized worker pools |
| Database | PostgreSQL 17 (pgvector), OpenSearch (HNSW vector index) |
| Storage | MinIO (S3-compatible), presigned URLs for streaming |
| ML | PyTorch, ONNX, CUDA 11.8+, bfloat16 mixed precision |
| DevOps | Docker Compose, pre-commit hooks, Ruff, MyPy, Prettier |
# Backend tests (inside Docker)
docker compose exec backend pytest tests/ -v
# ML pipeline integration tests (per worker)
docker compose exec celery-torch-worker pytest tests/test_ml_pipeline.py -m torch -v -s --timeout=600
docker compose exec celery-sam-worker pytest tests/test_ml_pipeline.py -m sam -v -s --timeout=600
docker compose exec celery-enhance-worker pytest tests/test_ml_pipeline.py -m enhance -v -s --timeout=300
docker compose exec celery-cpu-worker pytest tests/test_ml_pipeline.py -m cpu -v -s --timeout=300
# Unit/auth tests
docker compose exec backend pytest tests/ -m "not (torch or sam or enhance)" -v --timeout=120
# Frontend type check
cd frontend && npm run check
# Lint all code
pre-commit run --all-files
This project uses pre-commit hooks for code quality:
# Install pre-commit
pip install pre-commit
# Install hooks
pre-commit install
# Run manually
pre-commit run --all-files
Configured tools:
Key configuration options in .env:
# Database
POSTGRES_PASSWORD=secure-password # Auto-generated by setup
# Storage
MINIO_ROOT_PASSWORD=secure-password # Auto-generated by setup
# Redis
REDIS_PASSWORD=secure-password # Auto-generated by setup
# Security
JWT_SECRET_KEY=hex-string # Auto-generated by setup
ENCRYPTION_KEY=hex-string # Auto-generated by setup
AUTH_ENABLED=false # Set true to require login
# GPU
USE_GPU=auto # auto, true, false
TORCH_DEVICE=auto # auto, cuda, mps, cpu
GPU_MEMORY_FRACTION=0.9 # Max GPU memory fraction
# Models
MODEL_CACHE_DIR=./models
HF_TOKEN=your-huggingface-token # Required for gated models
DEEPFILTER_ATTEN_LIMIT=100 # Default attenuation limit (dB)
# Ports (all in 547X range)
FRONTEND_PORT=5473
BACKEND_PORT=5474
See .env.example for the complete list of configuration options.
For GPU acceleration:
Install NVIDIA Container Toolkit:
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Start with GPU support:
./openaudio.sh start gpu
| File | Purpose |
|---|---|
docker-compose.yml |
Base services (all containers) |
docker-compose.override.yml |
Development mode (hot reload, debug, builds from source) |
docker-compose.prod.yml |
Production mode (pulls from davidamacey/openaudio-* on DockerHub) |
docker-compose.gpu.yml |
GPU runtime overlay (NVIDIA) |
docker-compose.offline.yml |
Offline mode (no image pulls, no model downloads) |
Docker out of memory
# Increase Docker memory limit in Docker Desktop settings
# Or reduce concurrent workers in .env
GPU not detected
# Check NVIDIA driver
nvidia-smi
# Check container toolkit
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
CUDA OOM during analysis
use_lite=True (default)worker_concurrency=1 for GPU queuesModel download fails
# Set HuggingFace token
export HF_TOKEN=your-token
# Re-run download (uses Docker containers by default)
./scripts/download-models.sh
Celery tasks stuck
# Check Redis connection
./openaudio.sh logs redis
# Check worker status
./openaudio.sh logs celery-torch-worker
# View Flower dashboard at http://localhost:5475
# All logs
./openaudio.sh logs
# Specific service
./openaudio.sh logs backend
./openaudio.sh logs celery-torch-worker
./openaudio.sh logs celery-sam-worker
./openaudio.sh logs celery-cpu-worker
git checkout -b feature/amazing-featurepre-commit installpytestgit commit -m 'feat(scope): add amazing feature'git push origin feature/amazing-featureThis project uses Conventional Commits:
feat(api): add semantic search endpoint
fix(worker): handle empty audio files
docs(readme): update installation guide
This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.
The SAM Audio VRAM optimization ("Lite Mode") is based on the contribution by NilanEkanayake in facebookresearch/sam-audio Issue #24. This optimization removes unused video-related components to enable audio-only inference on consumer GPUs.