AI-Powered Transcription and Media Analysis Platform
OpenTranscribe is a powerful, containerized web application for transcribing and analyzing audio/video files using state-of-the-art AI models. Built with modern technologies and designed for scalability, it provides an end-to-end solution for speech-to-text conversion, speaker identification, and content analysis.
Note: This application is 99.9% written by AI using frontier models from commercial providers, demonstrating the power of AI-assisted development.
πΈ Quick Look
Complete workflow: Login β Upload β Process β Transcribe β Speaker Identification β AI Tags & Collections
π For detailed screenshots and visual guides, see the Complete Documentation
β¨ Key Features
π§ Advanced Transcription
- High-Accuracy Speech Recognition: Powered by WhisperX with faster-whisper backend
- Word-Level Timestamps: Precise timing for every word using WAV2VEC2 alignment
- Multi-Language Support: Transcribe in multiple languages with automatic English translation
- Batch Processing: 70x realtime speed with large-v2 model on GPU
- Audio Waveform Visualization: Interactive waveform player with precise timing and click-to-seek
- Browser Recording: Built-in microphone recording with real-time audio level monitoring
- Recording Controls: Pause/resume recording with duration tracking and quality settings
π₯ Smart Speaker Management
- Automatic Speaker Diarization: Identify different speakers using PyAnnote.audio
- Cross-Video Speaker Recognition: AI-powered voice fingerprinting to identify speakers across different media files
- Speaker Profile System: Create and manage global speaker profiles that persist across all transcriptions
- Intelligent Speaker Suggestions: Consolidated speaker identification with confidence scoring and automatic profile matching
- LLM-Enhanced Speaker Recognition: Content-based speaker identification using conversational context analysis
- Profile Embedding Service: Advanced voice similarity matching using vector embeddings for cross-video speaker linking
- Smart Speaker Status Tracking: Comprehensive speaker verification status with computed fields for UI optimization
- Auto-Profile Creation: Automatic speaker profile creation and assignment when speakers are labeled
- Retroactive Speaker Matching: Cross-video speaker matching with automatic label propagation for high-confidence matches
- Custom Speaker Labels: Edit and manage speaker names and information with intelligent suggestions
- Speaker Analytics: View speaking time distribution, cross-media appearances, and interaction patterns
- Universal Format Support: Audio (MP3, WAV, FLAC, M4A) and Video (MP4, MOV, AVI, MKV)
- YouTube Integration: Direct URL processing with automatic playlist support
- YouTube Playlist Processing: Extract and queue all videos from playlists for batch transcription
- Large File Support: Upload files up to 4GB for GoPro and high-quality video content
- Interactive Media Player: Click transcript to navigate playback
- Custom File Titles: Edit display names for media files with real-time search index updates
- Advanced Upload Manager: Floating, draggable upload manager with real-time progress tracking
- Concurrent Upload Processing: Multiple file uploads with queue management and retry logic
- Intelligent Upload System: Duplicate detection, hash verification, and automatic recovery
- Metadata Extraction: Comprehensive file information using ExifTool
- Subtitle Export: Generate SRT/VTT files for accessibility
- File Reprocessing: Re-run AI analysis while preserving user comments and annotations
- Auto-Recovery System: Intelligent detection and recovery of stuck or failed file processing
π Powerful Search & Discovery
- Hybrid Search: Combine keyword and semantic search capabilities
- Full-Text Indexing: Lightning-fast content search with OpenSearch 3.3.1 (Apache Lucene 10)
- 9.5x Faster Vector Search: Significantly improved semantic search performance
- 25% Faster Queries: Enhanced full-text search with lower latency
- Advanced Filtering: Filter by speaker, date, tags, duration, and more with searchable dropdowns
- Smart Tagging: Organize content with custom tags and categories
- Collections System: Group related media files into organized collections for better project management
- Speaker Usage Counts: See which speakers appear most frequently across your media library
π Analytics & Insights
- Advanced Content Analysis: Comprehensive speaker analytics including talk time, interruption detection, and turn-taking patterns
- Speaker Performance Metrics: Speaking pace (WPM), question frequency, and conversation flow analysis
- Meeting Efficiency Analytics: Silence ratio analysis and participation balance tracking
- Real-Time Analytics Computation: Server-side analytics computation with automatic refresh capabilities
- Cross-Video Speaker Analytics: Track speaker patterns and participation across multiple recordings
- AI-Powered Summarization: Generate summaries with flexible JSON schemas from custom prompts
- BLUF Format Support: Default Bottom Line Up Front structured summaries with action items
- Custom Summary Formats: Create unlimited AI prompts with ANY JSON structure
- Flexible Schema Storage: JSONB storage supporting multiple prompt types simultaneously
- Multi-Provider LLM Support: Use local vLLM, OpenAI, Ollama, Claude, or OpenRouter for AI features
- Intelligent Section Processing: Automatically handles transcripts of any length using section-by-section analysis
- Custom AI Prompts: Create and manage custom summarization prompts for different content types
- LLM Configuration Management: User-specific LLM settings with encrypted API key storage
- Provider Testing: Test LLM connections and validate configurations before use
- Real-Time Topic Extraction: AI-powered topic extraction with granular progress notifications
π¬ Collaboration Features
- Time-Stamped Comments: Add annotations at specific moments
- User Management: Role-based access control (admin/user) with personalized settings
- Recording Settings Management: User-specific audio recording preferences with quality controls
- Export Options: Download transcripts in multiple formats
- Real-Time Updates: Live progress tracking with detailed WebSocket notifications
- Enhanced Progress Tracking: 13 granular processing stages with descriptive messages
- Smart Notification System: Persistent notifications with unread count badges and progress updates
- WebSocket Integration: Real-time updates for transcription, summarization, and upload progress
- Collection Management: Create, organize, and share collections of related media files
- Smart Error Recovery: User-friendly error messages with specific guidance and auto-recovery options
- Full-Screen Transcript View: Dedicated modal for reading and searching long transcripts
- Auto-Refresh Systems: Background updates for file status without manual refreshing
ποΈ Recording & Audio Features
- Browser-Based Recording: Direct microphone recording with no plugins required
- Real-Time Audio Level Monitoring: Visual audio level feedback during recording
- Multi-Device Support: Choose from available microphone devices
- Recording Quality Control: Configurable bitrate and format settings
- Pause/Resume Recording: Full recording session control with duration tracking
- Background Upload Processing: Seamless integration with upload queue system
- Recording Session Management: Persistent recording state with navigation warnings
π€ AI-Powered Features
- Comprehensive LLM Integration: Support for 6+ providers (OpenAI, Claude, vLLM, Ollama, etc.)
- Custom Prompt Management: Create and manage AI prompts for different content types
- Encrypted Configuration Storage: Secure API key storage with user-specific settings
- Provider Connection Testing: Validate LLM configurations before use
- Intelligent Content Processing: Context-aware summarization with section-by-section analysis
- BLUF Format Summaries: Bottom Line Up Front structured summaries with action items
- Multi-Model Support: Works with models from 3B to 200B+ parameters
- Local & Cloud Processing: Support for both local (privacy-first) and cloud AI providers
- Multi-GPU Worker Scaling: Optional parallel processing on dedicated GPUs for high-throughput systems
- Specialized Worker Queues: GPU (transcription), Download (YouTube), CPU (waveform), NLP (AI features)
- Parallel Waveform Processing: CPU-based waveform generation runs simultaneously with GPU transcription
- Non-Blocking Architecture: LLM tasks don't delay next transcription (45-75s faster per 3-hour file)
- Configurable Concurrency: GPU(1-4), CPU(8), Download(3), NLP(4) workers for optimal resource utilization
- Enhanced Speaker Detection: Support for 20+ speakers (can scale to 50+ for large conferences)
- Accurate GPU Monitoring: nvidia-smi integration for real-time system-wide memory tracking
π± Enhanced User Experience
- Progressive Web App: Installable app experience with offline capabilities
- Responsive Design: Optimized for desktop, tablet, and mobile devices
- Interactive Waveform Player: Click-to-seek audio visualization with precise timing
- Floating Upload Manager: Draggable upload interface with real-time progress
- Smart Modal System: Consistent modal design with improved accessibility
- Enhanced Data Formatting: Server-side formatting service for consistent display of dates, durations, and file sizes
- Error Categorization: Intelligent error classification with user-friendly suggestions and retry guidance
- Smart Status Management: Comprehensive file and task status tracking with formatted display text
- Auto-Refresh Systems: Background data updates without manual page refreshing
- Theme Support: Seamless dark/light mode switching
- Keyboard Shortcuts: Efficient navigation and control via hotkeys
π οΈ Technology Stack
Frontend
- Svelte - Reactive UI framework with excellent performance
- TypeScript - Type-safe development with modern JavaScript and comprehensive ESLint integration
- Progressive Web App - Offline capabilities and native-like experience
- Responsive Design - Seamless experience across all devices
- Advanced UI Components - Draggable upload manager, modal consistency, and real-time status updates
- Code Quality Tooling - ESLint, TypeScript strict mode, and automated formatting
Backend
- FastAPI - High-performance async Python web framework
- SQLAlchemy 2.0 - Modern ORM with type safety
- Celery + Redis - Multi-queue distributed task processing for AI workloads
- GPU Queue (concurrency=1-4): GPU-intensive transcription and diarization
- Download Queue (concurrency=3): Parallel YouTube video/playlist downloads
- CPU Queue (concurrency=8): Waveform generation and audio processing
- NLP Queue (concurrency=4): LLM API calls and AI features
- Utility Queue (concurrency=2): Health checks and maintenance tasks
- WebSocket - Real-time communication for live updates
AI/ML Stack
- WhisperX - Advanced speech recognition with alignment
- PyAnnote.audio - Speaker diarization and voice analysis
- Faster-Whisper - Optimized inference engine
- Multi-Provider LLM Integration - Support for vLLM, OpenAI, Ollama, Anthropic Claude, and OpenRouter
- Local LLM Support - Privacy-focused processing with vLLM and Ollama
- Intelligent Context Processing - Section-by-section analysis handles unlimited transcript lengths
- Universal Model Compatibility - Works with any model size from 3B to 200B+ parameters
Infrastructure
- PostgreSQL - Reliable relational database with JSONB support for flexible schemas
- MinIO - S3-compatible object storage
- OpenSearch 3.3.1 - Full-text and vector search engine with Apache Lucene 10
- 9.5x faster vector search performance
- 25% faster queries with lower latency
- 75% lower p90 latency for aggregations
- Docker - Containerized deployment with multi-stage builds
- NGINX - Production web server
- Complete Offline Support - Full airgapped/offline deployment capability
π Quick Start
Prerequisites
# Required
- Docker and Docker Compose
- 8GB+ RAM (16GB+ recommended)
# Recommended for optimal performance
- NVIDIA GPU with CUDA support
Quick Installation (Using Docker Hub Images)
Run this one-liner to download and set up OpenTranscribe using our pre-built Docker Hub images:
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
Then follow the on-screen instructions. The setup script will:
- Detect your hardware (NVIDIA GPU, Apple Silicon, or CPU)
- Download the production Docker Compose file
- Configure environment variables with optimal settings for your hardware
- Prompt for your HuggingFace token (required for speaker diarization)
- Automatically download and cache AI models (~2.5GB) if token is provided
- Set up the management script (
opentranscribe.sh)
β οΈ IMPORTANT - HuggingFace Setup:
The script will prompt you for your HuggingFace token during setup. BEFORE running the installer:
- Get a FREE token: Visit https://huggingface.co/settings/tokens
- Accept BOTH gated model agreements (required for speaker diarization):
- Enter your token when prompted by the installer
If you provide a valid token with both model agreements accepted, AI models will be downloaded and cached before Docker starts, ensuring the app is ready to use immediately. If you skip this step, models will download on first use (10-30 minute delay).
Once setup is complete, start OpenTranscribe with:
cd opentranscribe
./opentranscribe.sh start
The Docker images are available on Docker Hub as separate repositories:
davidamacey/opentranscribe-backend: Backend service (also used for celery-worker and flower)
davidamacey/opentranscribe-frontend: Frontend service
Access the web interface at http://localhost:5173
Manual Installation (From Source)
Clone the Repository
git clone https://github.com/davidamacey/OpenTranscribe.git
cd OpenTranscribe
# Make utility script executable
chmod +x opentr.sh
Environment Configuration
# Copy environment template
cp .env.example .env
# Edit .env file with your settings (optional for development)
# Key variables:
# - HUGGINGFACE_TOKEN (required for speaker diarization)
# - GPU settings for optimal performance
Start OpenTranscribe
# Start in development mode (with hot reload)
./opentr.sh start dev
# Or start in production mode
./opentr.sh start prod
Access the Application
π OpenTranscribe Utility Commands
The opentr.sh script provides comprehensive management for all application operations:
Basic Operations
# Start the application
./opentr.sh start [dev|prod] # Start in development or production mode
./opentr.sh start dev --gpu-scale # Start with multi-GPU scaling (optional)
./opentr.sh stop # Stop all services
./opentr.sh status # Show container status
./opentr.sh logs [service] # View logs (all or specific service)
Multi-GPU Scaling (Optional)
For systems with multiple GPUs, enable parallel GPU workers for significantly increased transcription throughput:
# Configure in .env
GPU_SCALE_ENABLED=true # Enable multi-GPU scaling
GPU_SCALE_DEVICE_ID=2 # Which GPU to use (default: 2)
GPU_SCALE_WORKERS=4 # Number of parallel workers (default: 4)
# Start with GPU scaling
./opentr.sh start dev --gpu-scale
./opentr.sh reset dev --gpu-scale
# Example hardware setup:
# GPU 0: LLM model (vLLM, Ollama)
# GPU 1: Default single worker (disabled when scaling)
# GPU 2: 4 parallel workers (processes 4 videos simultaneously)
Performance: Process 4 transcriptions simultaneously on a high-end GPU, significantly reducing total processing time for batch uploads.
Development Workflow
# Service management
./opentr.sh restart-backend # Restart API and workers without database reset
./opentr.sh restart-frontend # Restart frontend only
./opentr.sh restart-all # Restart all services without data loss
# Container rebuilding (after code changes)
./opentr.sh rebuild-backend # Rebuild backend with new code
./opentr.sh rebuild-frontend # Rebuild frontend with new code
./opentr.sh build # Rebuild all containers
Database Management
# Data operations (β οΈ DESTRUCTIVE)
./opentr.sh reset [dev|prod] # Complete reset - deletes ALL data!
./opentr.sh init-db # Initialize database without container reset
# Backup and restore
./opentr.sh backup # Create timestamped database backup
./opentr.sh restore [file] # Restore from backup file
System Administration
# Maintenance
./opentr.sh clean # Remove unused containers and images
./opentr.sh health # Check service health status
./opentr.sh shell [service] # Open shell in container
# Available services: backend, frontend, postgres, redis, minio, opensearch, celery-worker
Monitoring and Debugging
# View specific service logs
./opentr.sh logs backend # API server logs
./opentr.sh logs celery-worker # AI processing logs
./opentr.sh logs frontend # Frontend development logs
./opentr.sh logs postgres # Database logs
# Follow logs in real-time
./opentr.sh logs backend -f
π― Usage Guide
Getting Started
User Registration
- Navigate to http://localhost:5173
- Create an account or use default admin credentials
- Set up your profile and preferences
Upload or Record Content
- File Upload: Click "Upload Files" or drag-and-drop media files (up to 4GB)
- Direct Recording: Use the microphone button in the navbar for browser-based recording
- URL Processing: Paste YouTube or media URLs for automatic processing
- Supported formats: MP3, WAV, MP4, MOV, and more
- Files are automatically queued for concurrent processing
Monitor Processing
- Watch detailed real-time progress with 13 processing stages
- Use the floating upload manager for multi-file progress tracking
- View task status in Flower monitor or notifications panel
- Receive live WebSocket notifications for all status changes
Explore Your Content
- Interactive Transcript: Click on transcript text to navigate media playback
- Waveform Player: Click on audio waveform for precise seeking
- Custom Titles: Edit file display names for better organization and searchability
- Speaker Management: Edit speaker names and add custom labels
- AI Summaries: Generate BLUF format summaries with custom prompts
- Comments: Add time-stamped comments and annotations
- Collections: Organize files into themed collections
- Full-Screen View: Use transcript modal for detailed reading and searching
Configure AI Features (Optional)
- Set up LLM providers in User Settings for AI summarization
- Create custom prompts for different content types
- Test provider connections before processing
Advanced Features
Recording Workflow
ποΈ Device Selection β π Level Monitoring β βΈοΈ Session Control β β¬οΈ Background Upload
- Choose from available microphone devices
- Monitor real-time audio levels during recording
- Pause/resume recording sessions with duration tracking
- Seamless integration with background upload processing
AI-Powered Processing
π€ LLM Configuration β π Custom Prompts β π Content Analysis β π BLUF Summaries
- Configure multiple LLM providers (OpenAI, Claude, vLLM, Ollama, etc.)
- Create custom prompts for different content types (meetings, interviews, podcasts)
- Test provider connections and validate configurations
- Generate structured summaries with action items and key decisions
Speaker Management
π₯ Automatic Detection β π€ AI Recognition β π·οΈ Profile Management β π Cross-Media Tracking
- Speakers are automatically detected and assigned labels using advanced AI diarization
- AI suggests speaker identities based on voice fingerprinting across your media library
- Create global speaker profiles that persist across all your transcriptions
- Accept or reject AI suggestions with confidence scores to improve accuracy over time
- Track speaker appearances across multiple media files with detailed analytics
Advanced Upload Management
β¬οΈ Concurrent Uploads β π Progress Tracking β π Retry Logic β π Queue Management
- Floating, draggable upload manager with real-time progress
- Multiple file uploads with intelligent queue processing
- Automatic retry logic for failed uploads with exponential backoff
- Duplicate detection with hash verification
Search and Discovery
π Keyword Search β π§ Semantic Search β π·οΈ Smart Filtering β π― Waveform Navigation
- Search transcript content with advanced filters
- Use semantic search to find related concepts
- Click-to-seek navigation via interactive waveform visualization
- Organize content with custom tags and categories
Collections Management
π Create Collections β π Organize Files β π·οΈ Bulk Operations β π― Inline Editing
- Group related media files into named collections
- Inline collection editing with tag-style interface
- Filter library view by specific collections
- Bulk add/remove files from collections with drag-and-drop support
Real-Time Notifications
π Progress Updates β π Status Tracking β π WebSocket Integration β β
Completion Alerts
- Persistent notification panel with unread count badges
- Real-time updates for transcription, summarization, and upload progress
- WebSocket integration for instant status updates
- Smart notification grouping and auto-refresh systems
Export and Integration
π Multiple Formats β πΊ Subtitle Files β π API Access β π¬ Media Downloads
- Export transcripts as TXT, JSON, or CSV
- Generate SRT/VTT subtitle files with embedded timing
- Access data programmatically via comprehensive REST API
- Download media files with embedded subtitles
π Project Structure
OpenTranscribe/
βββ π backend/ # Python FastAPI backend
β βββ π app/ # Application modules
β β βββ π api/ # REST API endpoints
β β βββ π models/ # Database models
β β βββ π services/ # Business logic
β β βββ π tasks/ # Background AI processing
β β βββ π utils/ # Common utilities
β β βββ π db/ # Database configuration
β βββ π scripts/ # Admin and maintenance scripts
β βββ π tests/ # Comprehensive test suite
β βββ π README.md # Backend documentation
βββ π frontend/ # Svelte frontend application
β βββ π src/ # Source code
β β βββ π components/ # Reusable UI components
β β βββ π routes/ # Page components
β β βββ π stores/ # State management
β β βββ π styles/ # CSS and themes
β βββ π README.md # Frontend documentation
βββ π database/ # Database initialization
βββ π models_ai/ # AI model storage (runtime)
βββ π scripts/ # Utility scripts
βββ π docker-compose.yml # Container orchestration
βββ π opentr.sh # Main utility script
βββ π README.md # This file
π§ Configuration
Environment Variables
Core Application
# Database
DATABASE_URL=postgresql://postgres:password@postgres:5432/opentranscribe
# Security
SECRET_KEY=your-super-secret-key-here
JWT_SECRET_KEY=your-jwt-secret-key
# Object Storage
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
MINIO_BUCKET_NAME=transcribe-app
AI Processing
# Required for speaker diarization - see setup instructions below
HUGGINGFACE_TOKEN=your_huggingface_token_here
# Model configuration
WHISPER_MODEL=large-v2 # large-v2, medium, small, base
COMPUTE_TYPE=float16 # float16, int8
BATCH_SIZE=16 # Reduce if GPU memory limited
# Speaker detection
MIN_SPEAKERS=1 # Minimum speakers to detect
MAX_SPEAKERS=20 # Maximum speakers to detect (can be increased to 50+ for large conferences)
# Model caching (recommended)
MODEL_CACHE_DIR=./models # Directory to store downloaded AI models
LLM Configuration (AI Features)
OpenTranscribe offers flexible AI deployment options. Choose the approach that best fits your infrastructure:
π§ Quick Setup Options:
Cloud-Only (Recommended for Most Users)
# Configure for OpenAI in .env
LLM_PROVIDER=openai
OPENAI_API_KEY=your_openai_key
OPENAI_MODEL_NAME=gpt-4o-mini
# Start without local LLM
./opentr.sh start dev
Local vLLM (Self-Hosted)
# Deploy vLLM server separately, then configure in .env
LLM_PROVIDER=vllm
VLLM_BASE_URL=http://your-vllm-server:8000/v1
VLLM_MODEL_NAME=gpt-oss-20b
# Start OpenTranscribe
./opentr.sh start dev
Local Ollama (Self-Hosted)
# Deploy Ollama server separately, then configure in .env
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://your-ollama-server:11434
OLLAMA_MODEL_NAME=llama3.2:3b-instruct-q4_K_M
# Start OpenTranscribe
./opentr.sh start dev
π Complete Provider Configuration:
# Cloud Providers (configure in .env)
LLM_PROVIDER=openai # openai, anthropic, custom (openrouter)
OPENAI_API_KEY=your_openai_key # OpenAI GPT models
ANTHROPIC_API_KEY=your_claude_key # Anthropic Claude models
OPENROUTER_API_KEY=your_or_key # OpenRouter (multi-provider)
# Local Providers (requires additional Docker services)
LLM_PROVIDER=vllm # Local vLLM server
LLM_PROVIDER=ollama # Local Ollama server
π― Deployment Scenarios:
- π° Cost-Effective: OpenRouter with Claude Haiku (~$0.25/1M tokens)
- π Privacy-First: Local vLLM or Ollama (no data leaves your server)
- β‘ Performance: OpenAI GPT-4o-mini (fastest cloud option)
- π± Small Models: Even 3B Ollama models can handle hours of content via intelligent sectioning
- π« No LLM: Leave
LLM_PROVIDER empty for transcription-only mode
See LLM_DEPLOYMENT_OPTIONS.md for detailed setup instructions.
ποΈ Model Caching
OpenTranscribe automatically downloads and caches AI models for optimal performance. Models are saved locally and reused across container restarts.
Default Setup:
- All models are cached to
./models/ directory in your project folder
- Models persist between Docker container restarts
- No re-downloading required after initial setup
Directory Structure:
./models/
βββ huggingface/ # PyAnnote + WhisperX models
β βββ hub/ # WhisperX transcription models (~1.5GB)
β βββ transformers/ # PyAnnote transformer models
βββ torch/ # PyTorch cache
βββ hub/checkpoints/ # Wav2Vec2 alignment model (~360MB)
βββ pyannote/ # PyAnnote diarization models (~500MB)
Custom Cache Location:
# Set custom directory in your .env file
MODEL_CACHE_DIR=/path/to/your/models
# Examples:
MODEL_CACHE_DIR=~/ai-models # Home directory
MODEL_CACHE_DIR=/mnt/storage/models # Network storage
MODEL_CACHE_DIR=./cache # Project subdirectory
Storage Requirements:
- WhisperX Models: ~1.5GB (depends on model size)
- PyAnnote Models: ~500MB (diarization + embedding)
- Alignment Model: ~360MB (Wav2Vec2)
- Total: ~2.5GB for complete setup```
π HuggingFace Token Setup
OpenTranscribe requires a HuggingFace token for speaker diarization and voice fingerprinting features. Follow these steps:
1. Generate HuggingFace Token
- Visit HuggingFace Settings > Access Tokens
- Click "New token" and select "Read" access
- Copy the generated token
2. Accept Model User Agreements β οΈ CRITICAL - MUST ACCEPT BOTH!
You MUST accept the user agreements for BOTH PyAnnote models or speaker diarization will fail:
Segmentation Model (Required):
Speaker Diarization Model (Required):
β οΈ Common Issue: If you only accept one model agreement, downloads will fail with 'NoneType' object has no attribute 'eval' error. You MUST accept BOTH agreements.
Add your token to the environment configuration:
For Production Installation:
# The setup script will prompt you for your token
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
For Manual Installation:
# Add to .env file
echo "HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" >> .env
Note: Without a valid HuggingFace token, speaker diarization will be disabled and speakers will not be automatically detected or identified across different media files.
# GPU settings
USE_GPU=true # Enable GPU acceleration
CUDA_VISIBLE_DEVICES=0 # GPU device selection
# Resource limits
MAX_UPLOAD_SIZE=4GB # Maximum file size (supports GoPro videos)
CELERY_WORKER_CONCURRENCY=2 # Concurrent tasks
Production Deployment
For production use, ensure you:
Security Configuration
# Generate strong secrets
openssl rand -hex 32 # For SECRET_KEY
openssl rand -hex 32 # For JWT_SECRET_KEY
# Set strong database passwords
# Configure proper firewall rules
# Set up SSL/TLS certificates
Performance Optimization
# Use production environment
NODE_ENV=production
# Configure resource limits
# Set up monitoring and logging
# Configure backup strategies
Reverse Proxy Setup
# Example NGINX configuration
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:5173;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /api {
proxy_pass http://localhost:5174;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
π§ͺ Development
Development Environment
# Start development with hot reload
./opentr.sh start dev
# Backend development
cd backend/
pip install -r requirements.txt
pytest tests/ # Run tests
black app/ # Format code
flake8 app/ # Lint code
# Frontend development
cd frontend/
npm install
npm run dev # Development server
npm run test # Run tests
npm run lint # Lint code
Testing
# Backend tests
./opentr.sh shell backend
pytest tests/ # All tests
pytest tests/api/ # API tests only
pytest --cov=app tests/ # With coverage
# Frontend tests
cd frontend/
npm run test # Unit tests
npm run test:e2e # End-to-end tests
npm run test:components # Component tests
Contributing
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
π Troubleshooting
Common Issues
GPU Not Detected
# Check GPU availability
nvidia-smi
# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Set CPU-only mode if needed
echo "USE_GPU=false" >> .env
Permission Errors (Model Cache / yt-dlp)
Symptoms:
- Error:
Permission denied: '/home/appuser/.cache/huggingface/hub'
- Error:
Permission denied: '/home/appuser/.cache/yt-dlp'
- YouTube downloads fail with permission errors
- Models fail to download or save
Cause: Docker creates model cache directories with root ownership, but containers run as non-root user (UID 1000) for security.
Solution:
# Option 1: Run the automated permission fix script (recommended)
cd opentranscribe # Or your installation directory
./scripts/fix-model-permissions.sh
# Option 2: Manual fix using Docker
docker run --rm -v ./models:/models busybox chown -R 1000:1000 /models
# Option 3: Manual fix using sudo (if available)
sudo chown -R 1000:1000 ./models
sudo chmod -R 755 ./models
Prevention for New Installations:
Why This Happens:
- Different Linux users have different UIDs (e.g., 1001, 1002)
- Running setup as root creates root-owned directories
- Docker version differences affect directory creation behavior
- The containers run as UID 1000 for security (non-root user)
Verification:
# Check directory ownership (should show UID 1000 or your user)
ls -la models/
# Test write permissions
touch models/huggingface/test.txt && rm models/huggingface/test.txt
Memory Issues
# Reduce model size
echo "WHISPER_MODEL=medium" >> .env
echo "BATCH_SIZE=8" >> .env
echo "COMPUTE_TYPE=int8" >> .env
# Monitor memory usage
docker stats
Slow Transcription
- Use GPU acceleration (
USE_GPU=true)
- Reduce model size (
WHISPER_MODEL=medium)
- Increase batch size if you have GPU memory
- Split large files into smaller segments
Database Connection Issues
# Reset database
./opentr.sh reset dev
# Check database logs
./opentr.sh logs postgres
# Verify database is running
./opentr.sh shell postgres psql -U postgres -l
Container Issues
# Check service status
./opentr.sh status
# Clean up resources
./opentr.sh clean
# Full reset (β οΈ deletes all data)
./opentr.sh reset dev
Getting Help
- π Documentation: Check README files in each component directory
- π Issues: Report bugs on GitHub Issues
- π¬ Discussions: Ask questions in GitHub Discussions
- π Monitoring: Use Flower dashboard for task debugging
Hardware Recommendations
Minimum Requirements
- 8GB RAM
- 4 CPU cores
- 50GB disk space
- Any modern GPU (optional but recommended)
Recommended Configuration
- 16GB+ RAM
- 8+ CPU cores
- 100GB+ SSD storage
- NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)
- High-speed internet for model downloads
Production Scale
- 32GB+ RAM
- 16+ CPU cores
- Multiple GPUs for parallel processing
- Fast NVMe storage
- Load balancer for multiple instances
# GPU optimization
COMPUTE_TYPE=float16 # Use half precision
BATCH_SIZE=32 # Increase for more GPU memory
WHISPER_MODEL=large-v2 # Best accuracy
# CPU optimization (if no GPU)
COMPUTE_TYPE=int8 # Use quantization
BATCH_SIZE=1 # Reduce memory usage
WHISPER_MODEL=base # Faster processing
π Security Considerations
Data Privacy
- All processing happens locally - no data sent to external services
- Optional: Disable external model downloads for air-gapped environments
- User data is encrypted at rest and in transit
- Configurable data retention policies
Access Control
- Role-based permissions (admin/user)
- File ownership validation
- API rate limiting
- Secure session management
Network Security
- All services run in isolated Docker network
- Configurable firewall rules
- Optional SSL/TLS termination
- Secure default configurations
π License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
The AGPL-3.0 license ensures that:
- The source code remains open and accessible to everyone
- Any modifications to the software must be made available to users
- Network use (SaaS) requires source code availability
- Protects the open source community and prevents proprietary forks
π Acknowledgments
- OpenAI Whisper - Foundation speech recognition model
- WhisperX - Enhanced alignment and diarization
- PyAnnote.audio - Speaker diarization capabilities
- FastAPI - Modern Python web framework
- Svelte - Reactive frontend framework
- Docker - Containerization platform
π Useful Links
Built with β€οΈ using AI assistance and modern open-source technologies.
OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.