Openaudio

davidamacey

AI-powered audio analysis platform — speaker diarization, sound event detection, text-guided source separation, and emotion recognition with a multi-track waveform editor

#audio-analysis #audio-processing #audio-separation #celery #deep-learning #docker #emotion-recognition #fastapi #gpu #machine-learning

Download

OpenAudio

AI-powered audio processing platform for multi-file audio analysis, cross-recording speaker tracking, and intelligent audio extraction. Transforms audio files into structured, searchable content using state-of-the-art ML models.

Features

Analysis & ML

Speaker Diarization - Identify and track speakers with PyAnnote v4, including overlap detection and 256-dim WeSpeaker embeddings
Sound Event Detection - Classify 521-527 AudioSet sound classes using three models: YAMNet, BEATs (Microsoft), or AST (Audio Spectrogram Transformer)
Source Separation - Text-guided audio source separation with Meta's SAM Audio ("extract the drums", "isolate the voice"), with Lite mode for consumer GPUs
Emotion Detection - Detect 8 emotions (angry, disgusted, fearful, happy, neutral, sad, surprised, other) using Emotion2Vec+ Large, per-file or per-speaker-segment
Audio Enhancement - Real-time noise reduction and speech enhancement with DeepFilterNet, configurable attenuation
Voice Activity Detection - CPU-based speech/silence segmentation with Silero VAD v6 (ONNX), configurable thresholds
Audio Feature Extraction - LibROSA-based analysis: RMS energy, spectral centroid/bandwidth/rolloff, MFCCs, chroma, tempo/beats, key detection, loudness (LUFS)
Cross-file Speaker Clustering - Match the same speaker across multiple recordings using embedding similarity with agglomerative clustering
Semantic Search - Search detected events by label, or trigger SAM Audio separation with a text query

Editor & UI

Multi-track Waveform Editor - Synchronized multi-file playback with per-track gain, mute, and solo controls
Unified Analysis Timeline - Layered display of speakers, events, emotions, and VAD segments on a shared timeline
Track Alignment - Automatic cross-correlation alignment of multi-track recordings with manual offset adjustment
Separation Tracks - Inline playback of SAM Audio separation results directly in the waveform editor
Enhancement Tracks - A/B comparison between original and enhanced audio
Region Selection - Select time regions across tracks for export or targeted separation
Speaker Management - Rename, merge, and color-code speakers; inline label editing on the timeline
Per-track Analysis Panel - View analysis details scoped to individual tracks

Export & Data

Region Export - Extract any time region in WAV, MP3, FLAC, or OGG at configurable sample rates (8-48 kHz) and bit depths (8/16/24-bit)
Speaker Export - Export all segments for a speaker, merged or as separate files
Event Export - Export detected events with configurable padding
Multi-region Export - Export the same time region from multiple files, as separate files (ZIP) or mixed into one
Collection Export - Full collection ZIP with all audio files and optional analysis metadata
Full Analysis Pipeline - One-click batch processing: queues VAD + diarization + event detection (all 3 models) + emotion detection across all files

Infrastructure

Task Queue Architecture - Celery workers with specialized queues for GPU, SAM, enhancement, CPU, and export tasks
Progress Tracking - Real-time task progress (0-100%) with status updates in the UI
VRAM Management - Model registry with priority-based auto-unloading and dtype-aware caching
Collection Management - Organize audio files into collections with metadata, alignment offsets, and shared speaker identities
Authentication - Optional JWT-based auth with registration, login, and protected endpoints

Quick Start

Option A: One-Liner Install (No Clone Required)

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenAudio/main/setup-openaudio.sh | bash

Downloads compose files from GitHub, pulls pre-built images from DockerHub, downloads ML models via containers. No git clone, no local Python, no building.

Option B: Clone & Setup

git clone https://github.com/davidamacey/OpenAudio.git
cd OpenAudio
./setup-openaudio.sh
./openaudio.sh start

Access the application:

Frontend: http://localhost:5473
API Documentation: http://localhost:5474/docs
MinIO Console: http://localhost:5479

Prerequisites

Required

Docker 20.10+ with Docker Compose
curl (for one-liner install)
8GB+ RAM (16GB recommended)
20GB+ free disk space

Optional (for GPU acceleration)

NVIDIA GPU with 6GB+ VRAM (8GB+ recommended)
NVIDIA Driver 525+
NVIDIA Container Toolkit

Installation

Option 1: One-Liner (Recommended)

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenAudio/main/setup-openaudio.sh | bash

Option 2: From Cloned Repo

git clone https://github.com/davidamacey/OpenAudio.git
cd OpenAudio
./setup-openaudio.sh

Both options will:

Check prerequisites (Docker, Docker Compose, curl)
Detect GPU configuration
Download compose files from GitHub (one-liner only)
Create .env from template with auto-generated secure passwords
Create required directories
Pull Docker images from DockerHub (or build from source in repo mode)
Initialize databases
Optionally download ML models via Docker containers

Option 3: Manual Setup

# 1. Copy environment template
cp .env.example .env

# 2. Edit .env - at minimum set secure passwords for:
#    POSTGRES_PASSWORD, MINIO_ROOT_PASSWORD, REDIS_PASSWORD,
#    JWT_SECRET_KEY, ENCRYPTION_KEY
nano .env

# 3. Create directories
mkdir -p models/{huggingface,torch,beats,modelscope} data/{uploads,outputs,exports} logs backups

# 4. Start services
./openaudio.sh start

# 5. Download models
./scripts/download-models.sh

Note: MinIO buckets are automatically created by the backend on startup.

Usage

Management CLI

# Start services
./openaudio.sh start           # Development mode
./openaudio.sh start prod      # Production mode
./openaudio.sh start gpu       # With GPU support

# Service management
./openaudio.sh stop            # Stop all services
./openaudio.sh restart         # Restart services
./openaudio.sh status          # Show service status
./openaudio.sh logs [service]  # View logs
./openaudio.sh health          # Health check

# Database operations
./openaudio.sh db migrate      # Run migrations
./openaudio.sh db backup       # Create backup
./openaudio.sh db restore FILE # Restore from backup
./openaudio.sh db reset        # Reset database (WARNING: deletes data)

# Development
./openaudio.sh shell backend   # Open backend shell
./openaudio.sh shell db        # Open database shell
./openaudio.sh build           # Rebuild containers
./openaudio.sh rebuild SERVICE # Rebuild specific service

# Model management
./openaudio.sh download-models # Download ML models (via Docker)

# Image management (from cloned repo only)
./openaudio.sh push            # Build + push images to DockerHub
./openaudio.sh push --version=0.1.0  # Tag with version

# Maintenance
./openaudio.sh clean           # Clean up Docker resources
./openaudio.sh purge           # Remove all data (destructive)

API Usage

# Create a collection
curl -X POST "http://localhost:5474/api/v1/collections" \
  -H "Content-Type: application/json" \
  -d '{"name": "My Collection"}'

# Upload audio files
curl -X POST "http://localhost:5474/api/v1/collections/{collection_id}/files" \
  -F "[email protected]" -F "[email protected]"

# Run full analysis pipeline on all files
curl -X POST "http://localhost:5474/api/v1/analysis/full?collectionId={collection_id}"

# Run speaker diarization on a single file
curl -X POST "http://localhost:5474/api/v1/analysis/diarize/{file_id}" \
  -H "Content-Type: application/json" \
  -d '{"extract_embeddings": true}'

# Separate audio with a text prompt
curl -X POST "http://localhost:5474/api/v1/analysis/separate/{file_id}" \
  -H "Content-Type: application/json" \
  -d '{"text_prompt": "singing voice"}'

# Search for events across a collection
curl -X POST "http://localhost:5474/api/v1/analysis/search?query=speech&collectionId={collection_id}"

# Export a time region
curl -X POST "http://localhost:5474/api/v1/export/region/{file_id}?start_time=10&end_time=30&format=mp3"

Architecture

                                    +------------------+
                                    |    Frontend      |
                                    |   (SvelteKit)    |
                                    |     :5473        |
                                    +--------+---------+
                                             |
                                             v
+------------------+              +----------+---------+              +------------------+
|    MinIO         |<------------>|    Backend API     |<------------>|   PostgreSQL     |
| (Object Storage) |              |     (FastAPI)      |              |   (Database)     |
|     :5478        |              |      :5474         |              |     :5476        |
+------------------+              +----------+---------+              +------------------+
                                             |
                    +------------+-----------+-----------+------------+
                    |            |                       |            |
                    v            v                       v            v
           +-------+------+ +---+--------+      +------+------+ +---+--------+
           | Torch Worker | | SAM Worker |      | Enhance Wkr | | CPU Worker |
           |  (GPU queue) | | (SAM queue)|      |(enhance q)  | | (cpu queue)|
           +-------+------+ +---+--------+      +------+------+ +---+--------+
                    |            |                       |            |
                    v            v                       v            v
           +-------+------+ +---+--------+      +------+------+ +---+--------+
           | PyAnnote     | | SAM Audio  |      | DeepFilter  | | Silero VAD |
           | BEATs/YAMNet | | (Separate) |      | (Enhance)   | | Alignment  |
           | AST/Emotion  | |            |      |             | | Features   |
           +-------+------+ +------------+      +-------------+ +------------+

Services

Service	Port	Description
Frontend	5473	SvelteKit 2 + Svelte 5 web application
Backend API	5474	FastAPI REST API
Flower	5475	Celery task monitoring dashboard
PostgreSQL	5476	Primary database (pgvector)
Redis	5477	Celery task queue broker
MinIO	5478/5479	Object storage (API/Console)

Docker Images

Pre-built images are published to DockerHub under davidamacey/:

Image	Dockerfile	Used by
`openaudio-backend`	`backend/Dockerfile`	backend, cpu-worker, beat, flower
`openaudio-torch`	`backend/Dockerfile.torch`	celery-torch-worker
`openaudio-sam`	`backend/Dockerfile.sam`	celery-sam-worker
`openaudio-enhance`	`backend/Dockerfile.enhance`	celery-enhance-worker
`openaudio-frontend`	`frontend/Dockerfile.prod`	frontend

Pin a version by setting IMAGE_TAG=0.1.0 in .env (default: latest).

ML Models

Model	Purpose	VRAM	Queue
PyAnnote v4	Speaker diarization + overlap detection	~2 GB	gpu
WeSpeaker	256-dim speaker embeddings	(with PyAnnote)	gpu
YAMNet (PyTorch)	Sound event classification (521 classes)	<1 GB	gpu
BEATs (Microsoft)	Audio event classification (527 classes)	~400 MB	gpu
AST	Audio spectrogram classification (527 classes)	~1 GB	gpu
Emotion2Vec+ Large	Speech emotion detection (8 emotions)	~1 GB	gpu
SAM Audio Base (Lite)	Text-guided source separation	~5 GB	sam
SAM Audio Small (Lite)	Text-guided source separation	~3.5 GB	sam
DeepFilterNet	Audio enhancement / noise reduction	~500 MB	enhance
Silero VAD v6	Voice activity detection (ONNX, CPU-only)	0 GB	cpu

Celery Worker Queues

Queue	Concurrency	Tasks
gpu	1	Diarization, YAMNet/BEATs/AST event detection, emotion detection, speaker embedding extraction
sam	1	SAM Audio text-guided source separation (warm-start preloading, chunked processing)
enhance	4	DeepFilterNet noise reduction
cpu	8	Silero VAD, cross-correlation alignment, LibROSA feature extraction, speaker clustering, audio metrics
export	2	Region/speaker/event/collection/multi-region exports

SAM Audio VRAM Optimization

OpenAudio uses a "Lite Mode" for Meta's SAM Audio model that removes unused video-related components to enable audio-only inference on consumer GPUs:

Component Removed	VRAM Saved	Why Not Needed
Vision Encoder	~2 GB	Video frame encoding
Visual Ranker	~2 GB	Video-based source ranking
Text Ranker	~2 GB	Text-based reranking
Span Predictor	~1-2 GB	Temporal span prediction

Total Reduction: 11 GB → 4-5 GB (~55% savings), enabling RTX 3060/4060 (6GB) GPUs.

Additional runtime optimizations:

bfloat16 precision halves remaining memory usage
25-second chunk processing prevents OOM on long audio
torch.inference_mode + autocast for efficient mixed-precision inference
Single-task prefetch (worker_prefetch_multiplier=1) prevents GPU memory contention

API Reference

Interactive API documentation is available at:

Swagger UI: http://localhost:5474/docs
ReDoc: http://localhost:5474/redoc

(Disabled in production mode when ENVIRONMENT=production)

Analysis Endpoints

Method	Endpoint	Description
GET	`/api/v1/analysis/models`	List available ML models (filter by type)
POST	`/api/v1/analysis/full`	Run full pipeline on collection (VAD + diarization + 3x event detection + emotion)
POST	`/api/v1/analysis/diarize/{file_id}`	Speaker diarization (PyAnnote v4)
POST	`/api/v1/analysis/detect-events/{file_id}`	Sound event detection (YAMNet, BEATs, or AST)
POST	`/api/v1/analysis/detect-emotions/{file_id}`	Emotion detection (Emotion2Vec+)
POST	`/api/v1/analysis/vad/{file_id}`	Voice activity detection (Silero VAD)
POST	`/api/v1/analysis/separate/{file_id}`	Text-guided source separation (SAM Audio)
POST	`/api/v1/analysis/enhance/{file_id}`	Audio enhancement (DeepFilterNet)
POST	`/api/v1/analysis/features/{file_id}`	Audio feature extraction (LibROSA)
POST	`/api/v1/analysis/cluster-speakers`	Cross-file speaker clustering
POST	`/api/v1/analysis/alignment`	Automatic multi-track alignment
POST	`/api/v1/analysis/search`	Semantic search (event labels + SAM separation)
GET	`/api/v1/analysis/diarization/{file_id}/results`	Get speaker segments and overlaps
GET	`/api/v1/analysis/events/{file_id}/results`	Get detected events (filter by model)
GET	`/api/v1/analysis/emotions/{file_id}/results`	Get emotion segments
GET	`/api/v1/analysis/vad/{file_id}/results`	Get speech/silence segments
GET	`/api/v1/analysis/separation/{file_id}/results`	Get separation audio (presigned URLs)
GET	`/api/v1/analysis/enhancement/{file_id}/results`	Get enhanced audio (presigned URLs)
DELETE	`/api/v1/analysis/separation/{task_id}`	Delete separation result and audio
DELETE	`/api/v1/analysis/enhancement/{task_id}`	Delete enhancement result and audio

Collection & File Endpoints

Method	Endpoint	Description
POST	`/api/v1/collections`	Create collection
GET	`/api/v1/collections`	List collections (paginated, searchable)
GET	`/api/v1/collections/{id}`	Get collection with files, speakers, events
PUT	`/api/v1/collections/{id}`	Update collection
DELETE	`/api/v1/collections/{id}`	Delete collection
POST	`/api/v1/collections/{id}/files`	Upload audio files
GET	`/api/v1/files/{id}`	Get file metadata
DELETE	`/api/v1/files/{id}`	Delete file

Export Endpoints

Method	Endpoint	Description
POST	`/api/v1/export/region/{file_id}`	Export time region (WAV/MP3/FLAC/OGG)
POST	`/api/v1/export/speaker/{speaker_id}`	Export speaker segments (merged or separate)
POST	`/api/v1/export/event/{event_id}`	Export event audio with padding
POST	`/api/v1/export/collection/{collection_id}`	Export full collection as ZIP
POST	`/api/v1/export/multi-region`	Export same region from multiple files (separate or mixed)
GET	`/api/v1/export/download/{task_id}`	Download completed export

Other Endpoints

Method	Endpoint	Description
GET/PUT	`/api/v1/speakers/{id}`	Get/update speaker (rename, merge)
POST	`/api/v1/speakers/merge`	Merge multiple speakers
GET	`/api/v1/tasks/{id}`	Get task status and progress
DELETE	`/api/v1/tasks/{id}`	Cancel a running task
POST	`/api/v1/auth/register`	User registration
POST	`/api/v1/auth/login`	User login (JWT)

Development

Project Structure

OpenAudio/
├── backend/
│   ├── app/
│   │   ├── api/v1/          # API endpoints (analysis, collections, export, files, speakers, tasks, auth)
│   │   ├── core/            # Configuration, security, Celery, constants
│   │   ├── db/models/       # SQLAlchemy models (collection, audio_file, speaker, speaker_segment,
│   │   │                    #   sound_event, emotion_segment, vad_segment, task)
│   │   ├── ml/models/       # ML model wrappers (pyannote_v4, sam_audio, yamnet, beats, ast,
│   │   │                    #   emotion2vec, deepfilter, silero_vad, audio_features)
│   │   ├── ml/registry.py   # Model registry with VRAM tracking and auto-unloading
│   │   ├── schemas/         # Pydantic request/response schemas
│   │   └── services/        # Business logic (collection, alignment, minio, opensearch)
│   ├── workers/
│   │   ├── gpu_tasks.py     # GPU: diarization, YAMNet, BEATs, AST, emotion, embeddings
│   │   ├── sam_tasks.py     # SAM: source separation with warm-start preloading
│   │   ├── enhance_tasks.py # Enhancement: DeepFilterNet noise reduction
│   │   ├── cpu_tasks.py     # CPU: VAD, alignment, features, clustering, metrics
│   │   └── export_tasks.py  # Export: region, speaker, event, collection, multi-region
│   ├── alembic/             # Database migrations
│   └── tests/               # Backend tests (unit, auth, ML pipeline integration)
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   │   ├── waveform/    # MultiTrackEditor, WaveformTrack, SeparationTrack, EnhanceTrack,
│   │   │   │                #   RegionSelector, AlignmentControls, SyncDialog, TimeRuler
│   │   │   ├── timeline/    # UnifiedTimeline, AnalysisRow, SearchRow, SpeakerLabelEditor
│   │   │   ├── speakers/    # SpeakerList, SpeakerTimeline, SpeakerBadge
│   │   │   ├── analysis/    # AnalysisPanel, ModelSelector, SeparationPromptInput
│   │   │   ├── export/      # ExportDialog
│   │   │   └── common/      # Header, Modal, Toast, Button, ProgressBar, AuthGuard
│   │   ├── lib/
│   │   │   ├── stores/      # Svelte 5 rune stores (playback, collections, timeline, selection,
│   │   │   │                #   tasks, files, separations, enhancements, auth, settings)
│   │   │   ├── api/         # API clients (analysis, collections, export, files, auth)
│   │   │   ├── types/       # TypeScript types (collection, task, speaker, separation, enhancement)
│   │   │   └── utils/       # Timeline, time, transform, waveform, audioExtraction utilities
│   │   └── routes/          # Pages: landing, login, register, library, collection workspace
│   └── static/              # Static assets
├── scripts/                 # Shell scripts (setup, model download, image build/push)
├── models/                  # ML model cache (host-mounted)
├── data/                    # Uploaded/processed files
├── docker-compose.yml       # Base Docker services
├── docker-compose.override.yml  # Development overrides
├── docker-compose.prod.yml  # Production overrides
├── docker-compose.gpu.yml   # GPU runtime overlay
├── docker-compose.offline.yml   # Offline mode
├── openaudio.sh             # Management CLI
└── setup-openaudio.sh       # First-time setup

Tech Stack

Layer	Technology
Frontend	SvelteKit 2 + Svelte 5 (runes), TypeScript, Tailwind CSS, WaveSurfer.js
Backend	Python 3.11+, FastAPI, SQLAlchemy, Pydantic, Alembic
Task Queue	Celery + Redis, specialized worker pools
Database	PostgreSQL 17 (pgvector), OpenSearch (HNSW vector index)
Storage	MinIO (S3-compatible), presigned URLs for streaming
ML	PyTorch, ONNX, CUDA 11.8+, bfloat16 mixed precision
DevOps	Docker Compose, pre-commit hooks, Ruff, MyPy, Prettier

Running Tests

# Backend tests (inside Docker)
docker compose exec backend pytest tests/ -v

# ML pipeline integration tests (per worker)
docker compose exec celery-torch-worker pytest tests/test_ml_pipeline.py -m torch -v -s --timeout=600
docker compose exec celery-sam-worker pytest tests/test_ml_pipeline.py -m sam -v -s --timeout=600
docker compose exec celery-enhance-worker pytest tests/test_ml_pipeline.py -m enhance -v -s --timeout=300
docker compose exec celery-cpu-worker pytest tests/test_ml_pipeline.py -m cpu -v -s --timeout=300

# Unit/auth tests
docker compose exec backend pytest tests/ -m "not (torch or sam or enhance)" -v --timeout=120

# Frontend type check
cd frontend && npm run check

# Lint all code
pre-commit run --all-files

Code Quality

This project uses pre-commit hooks for code quality:

# Install pre-commit
pip install pre-commit

# Install hooks
pre-commit install

# Run manually
pre-commit run --all-files

Configured tools:

Ruff - Python linting and formatting
MyPy - Python type checking
Prettier - Frontend formatting
Hadolint - Dockerfile linting
Shellcheck - Shell script linting
Gitleaks - Secret detection
Bandit - Python security analysis

Configuration

Environment Variables

Key configuration options in .env:

# Database
POSTGRES_PASSWORD=secure-password    # Auto-generated by setup

# Storage
MINIO_ROOT_PASSWORD=secure-password  # Auto-generated by setup

# Redis
REDIS_PASSWORD=secure-password       # Auto-generated by setup

# Security
JWT_SECRET_KEY=hex-string            # Auto-generated by setup
ENCRYPTION_KEY=hex-string            # Auto-generated by setup
AUTH_ENABLED=false                   # Set true to require login

# GPU
USE_GPU=auto                         # auto, true, false
TORCH_DEVICE=auto                    # auto, cuda, mps, cpu
GPU_MEMORY_FRACTION=0.9              # Max GPU memory fraction

# Models
MODEL_CACHE_DIR=./models
HF_TOKEN=your-huggingface-token      # Required for gated models
DEEPFILTER_ATTEN_LIMIT=100           # Default attenuation limit (dB)

# Ports (all in 547X range)
FRONTEND_PORT=5473
BACKEND_PORT=5474

See .env.example for the complete list of configuration options.

GPU Configuration

For GPU acceleration:

Install NVIDIA Container Toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Start with GPU support:
```
./openaudio.sh start gpu
```

Docker Compose Profiles

File	Purpose
`docker-compose.yml`	Base services (all containers)
`docker-compose.override.yml`	Development mode (hot reload, debug, builds from source)
`docker-compose.prod.yml`	Production mode (pulls from `davidamacey/openaudio-*` on DockerHub)
`docker-compose.gpu.yml`	GPU runtime overlay (NVIDIA)
`docker-compose.offline.yml`	Offline mode (no image pulls, no model downloads)

Troubleshooting

Common Issues

Docker out of memory

# Increase Docker memory limit in Docker Desktop settings
# Or reduce concurrent workers in .env

GPU not detected

# Check NVIDIA driver
nvidia-smi

# Check container toolkit
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

CUDA OOM during analysis

Reduce batch size or use smaller model variants
SAM Audio: ensure use_lite=True (default)
Process long audio in smaller chunks
Check that worker_concurrency=1 for GPU queues

Model download fails

# Set HuggingFace token
export HF_TOKEN=your-token

# Re-run download (uses Docker containers by default)
./scripts/download-models.sh

Celery tasks stuck

# Check Redis connection
./openaudio.sh logs redis

# Check worker status
./openaudio.sh logs celery-torch-worker

# View Flower dashboard at http://localhost:5475

Logs

# All logs
./openaudio.sh logs

# Specific service
./openaudio.sh logs backend
./openaudio.sh logs celery-torch-worker
./openaudio.sh logs celery-sam-worker
./openaudio.sh logs celery-cpu-worker

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Install pre-commit hooks: pre-commit install
Make your changes
Run tests: pytest
Commit: git commit -m 'feat(scope): add amazing feature'
Push: git push origin feature/amazing-feature
Open a Pull Request

Commit Convention

This project uses Conventional Commits:

feat(api): add semantic search endpoint
fix(worker): handle empty audio files
docs(readme): update installation guide

License

This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.

Acknowledgments

pyannote.audio - Speaker diarization
YAMNet / torch_audioset - Sound event detection (PyTorch port)
BEATs - Audio pre-training by Microsoft
AST - Audio Spectrogram Transformer
SAM Audio - Text-guided audio separation by Meta AI Research
Emotion2Vec - Speech emotion recognition
DeepFilterNet - Audio enhancement
Silero VAD - Voice activity detection
LibROSA - Audio feature extraction

SAM Audio Lite Mode Credits

The SAM Audio VRAM optimization ("Lite Mode") is based on the contribution by NilanEkanayake in facebookresearch/sam-audio Issue #24. This optimization removes unused video-related components to enable audio-only inference on consumer GPUs.

Top categories

tailwind daisyui admin template popup mdsvex portfolio blog form ecommerce ui carousel auth dark seo image routing