AI-Powered Transcription and Media Analysis Platform
OpenTranscribe is a powerful, containerized web application for transcribing and analyzing audio/video files using state-of-the-art AI models. Built with modern technologies and designed for scalability, it provides an end-to-end solution for speech-to-text conversion, speaker identification, and content analysis.
Note: This application is 99.9% created by AI using Windsurf and various commercial LLMs, demonstrating the power of AI-assisted development.
β¨ Key Features
π§ Advanced Transcription
- High-Accuracy Speech Recognition: Powered by WhisperX with faster-whisper backend
- Word-Level Timestamps: Precise timing for every word using WAV2VEC2 alignment
- Multi-Language Support: Transcribe in multiple languages with automatic English translation
- Batch Processing: 70x realtime speed with large-v2 model on GPU
- Audio Waveform Visualization: Interactive waveform player with precise timing and click-to-seek
- Browser Recording: Built-in microphone recording with real-time audio level monitoring
- Recording Controls: Pause/resume recording with duration tracking and quality settings
π₯ Smart Speaker Management
- Automatic Speaker Diarization: Identify different speakers using PyAnnote.audio
- Cross-Video Speaker Recognition: AI-powered voice fingerprinting to identify speakers across different media files
- Speaker Profile System: Create and manage global speaker profiles that persist across all transcriptions
- Intelligent Speaker Suggestions: Consolidated speaker identification with confidence scoring and automatic profile matching
- LLM-Enhanced Speaker Recognition: Content-based speaker identification using conversational context analysis
- Profile Embedding Service: Advanced voice similarity matching using vector embeddings for cross-video speaker linking
- Smart Speaker Status Tracking: Comprehensive speaker verification status with computed fields for UI optimization
- Auto-Profile Creation: Automatic speaker profile creation and assignment when speakers are labeled
- Retroactive Speaker Matching: Cross-video speaker matching with automatic label propagation for high-confidence matches
- Custom Speaker Labels: Edit and manage speaker names and information with intelligent suggestions
- Speaker Analytics: View speaking time distribution, cross-media appearances, and interaction patterns
- Universal Format Support: Audio (MP3, WAV, FLAC, M4A) and Video (MP4, MOV, AVI, MKV)
- Large File Support: Upload files up to 4GB for GoPro and high-quality video content
- Interactive Media Player: Click transcript to navigate playback
- Custom File Titles: Edit display names for media files with real-time search index updates
- Advanced Upload Manager: Floating, draggable upload manager with real-time progress tracking
- Concurrent Upload Processing: Multiple file uploads with queue management and retry logic
- Intelligent Upload System: Duplicate detection, hash verification, and automatic recovery
- Metadata Extraction: Comprehensive file information using ExifTool
- Subtitle Export: Generate SRT/VTT files for accessibility
- File Reprocessing: Re-run AI analysis while preserving user comments and annotations
- Auto-Recovery System: Intelligent detection and recovery of stuck or failed file processing
π Powerful Search & Discovery
- Hybrid Search: Combine keyword and semantic search capabilities
- Full-Text Indexing: Lightning-fast content search with OpenSearch
- Advanced Filtering: Filter by speaker, date, tags, duration, and more
- Smart Tagging: Organize content with custom tags and categories
- Collections System: Group related media files into organized collections for better project management
π Analytics & Insights
- Advanced Content Analysis: Comprehensive speaker analytics including talk time, interruption detection, and turn-taking patterns
- Speaker Performance Metrics: Speaking pace (WPM), question frequency, and conversation flow analysis
- Meeting Efficiency Analytics: Silence ratio analysis and participation balance tracking
- Real-Time Analytics Computation: Server-side analytics computation with automatic refresh capabilities
- Cross-Video Speaker Analytics: Track speaker patterns and participation across multiple recordings
- AI-Powered Summarization: Generate BLUF (Bottom Line Up Front) format summaries with meeting insights
- Multi-Provider LLM Support: Use local vLLM, OpenAI, Ollama, Claude, or OpenRouter for AI features
- Intelligent Section Processing: Automatically handles transcripts of any length using section-by-section analysis
- Custom AI Prompts: Create and manage custom summarization prompts for different content types
- LLM Configuration Management: User-specific LLM settings with encrypted API key storage
- Provider Testing: Test LLM connections and validate configurations before use
π¬ Collaboration Features
- Time-Stamped Comments: Add annotations at specific moments
- User Management: Role-based access control (admin/user) with personalized settings
- Recording Settings Management: User-specific audio recording preferences with quality controls
- Export Options: Download transcripts in multiple formats
- Real-Time Updates: Live progress tracking with detailed WebSocket notifications
- Enhanced Progress Tracking: 13 granular processing stages with descriptive messages
- Smart Notification System: Persistent notifications with unread count badges and progress updates
- WebSocket Integration: Real-time updates for transcription, summarization, and upload progress
- Collection Management: Create, organize, and share collections of related media files
- Smart Error Recovery: User-friendly error messages with specific guidance and auto-recovery options
- Full-Screen Transcript View: Dedicated modal for reading and searching long transcripts
- Auto-Refresh Systems: Background updates for file status without manual refreshing
ποΈ Recording & Audio Features
- Browser-Based Recording: Direct microphone recording with no plugins required
- Real-Time Audio Level Monitoring: Visual audio level feedback during recording
- Multi-Device Support: Choose from available microphone devices
- Recording Quality Control: Configurable bitrate and format settings
- Pause/Resume Recording: Full recording session control with duration tracking
- Background Upload Processing: Seamless integration with upload queue system
- Recording Session Management: Persistent recording state with navigation warnings
π€ AI-Powered Features
- Comprehensive LLM Integration: Support for 6+ providers (OpenAI, Claude, vLLM, Ollama, etc.)
- Custom Prompt Management: Create and manage AI prompts for different content types
- Encrypted Configuration Storage: Secure API key storage with user-specific settings
- Provider Connection Testing: Validate LLM configurations before use
- Intelligent Content Processing: Context-aware summarization with section-by-section analysis
- BLUF Format Summaries: Bottom Line Up Front structured summaries with action items
- Multi-Model Support: Works with models from 3B to 200B+ parameters
- Local & Cloud Processing: Support for both local (privacy-first) and cloud AI providers
π± Enhanced User Experience
- Progressive Web App: Installable app experience with offline capabilities
- Responsive Design: Optimized for desktop, tablet, and mobile devices
- Interactive Waveform Player: Click-to-seek audio visualization with precise timing
- Floating Upload Manager: Draggable upload interface with real-time progress
- Smart Modal System: Consistent modal design with improved accessibility
- Enhanced Data Formatting: Server-side formatting service for consistent display of dates, durations, and file sizes
- Error Categorization: Intelligent error classification with user-friendly suggestions and retry guidance
- Smart Status Management: Comprehensive file and task status tracking with formatted display text
- Auto-Refresh Systems: Background data updates without manual page refreshing
- Theme Support: Seamless dark/light mode switching
- Keyboard Shortcuts: Efficient navigation and control via hotkeys
π οΈ Technology Stack
Frontend
- Svelte - Reactive UI framework with excellent performance
- TypeScript - Type-safe development with modern JavaScript and comprehensive ESLint integration
- Progressive Web App - Offline capabilities and native-like experience
- Responsive Design - Seamless experience across all devices
- Advanced UI Components - Draggable upload manager, modal consistency, and real-time status updates
- Code Quality Tooling - ESLint, TypeScript strict mode, and automated formatting
Backend
- FastAPI - High-performance async Python web framework
- SQLAlchemy 2.0 - Modern ORM with type safety
- Celery + Redis - Distributed task processing for AI workloads
- WebSocket - Real-time communication for live updates
AI/ML Stack
- WhisperX - Advanced speech recognition with alignment
- PyAnnote.audio - Speaker diarization and voice analysis
- Faster-Whisper - Optimized inference engine
- Multi-Provider LLM Integration - Support for vLLM, OpenAI, Ollama, Anthropic Claude, and OpenRouter
- Local LLM Support - Privacy-focused processing with vLLM and Ollama
- Intelligent Context Processing - Section-by-section analysis handles unlimited transcript lengths
- Universal Model Compatibility - Works with any model size from 3B to 200B+ parameters
Infrastructure
- PostgreSQL - Reliable relational database
- MinIO - S3-compatible object storage
- OpenSearch - Full-text and vector search engine
- Docker - Containerized deployment
- NGINX - Production web server
π Quick Start
Prerequisites
# Required
- Docker and Docker Compose
- 8GB+ RAM (16GB+ recommended)
# Recommended for optimal performance
- NVIDIA GPU with CUDA support
Quick Installation (Using Docker Hub Images)
Run this one-liner to download and set up OpenTranscribe using our pre-built Docker Hub images:
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
Then follow the on-screen instructions. The setup script will:
- Download the production Docker Compose file
- Configure environment variables including GPU support (default GPU device ID: 2)
- Help you set up your Hugging Face token (required for speaker diarization)
- Set up the management script (
opentranscribe.sh
)
Once setup is complete, start OpenTranscribe with:
cd opentranscribe
./opentranscribe.sh start
The Docker images are available on Docker Hub as separate repositories:
davidamacey/opentranscribe-backend
: Backend service (also used for celery-worker and flower)
davidamacey/opentranscribe-frontend
: Frontend service
Access the web interface at http://localhost:5173
Manual Installation (From Source)
Clone the Repository
git clone https://github.com/davidamacey/OpenTranscribe.git
cd OpenTranscribe
# Make utility script executable
chmod +x opentr.sh
Environment Configuration
# Copy environment template
cp .env.example .env
# Edit .env file with your settings (optional for development)
# Key variables:
# - HUGGINGFACE_TOKEN (required for speaker diarization)
# - GPU settings for optimal performance
Start OpenTranscribe
# Start in development mode (with hot reload)
./opentr.sh start dev
# Or start in production mode
./opentr.sh start prod
Access the Application
π OpenTranscribe Utility Commands
The opentr.sh
script provides comprehensive management for all application operations:
Basic Operations
# Start the application
./opentr.sh start [dev|prod] # Start in development or production mode
./opentr.sh stop # Stop all services
./opentr.sh status # Show container status
./opentr.sh logs [service] # View logs (all or specific service)
Development Workflow
# Service management
./opentr.sh restart-backend # Restart API and workers without database reset
./opentr.sh restart-frontend # Restart frontend only
./opentr.sh restart-all # Restart all services without data loss
# Container rebuilding (after code changes)
./opentr.sh rebuild-backend # Rebuild backend with new code
./opentr.sh rebuild-frontend # Rebuild frontend with new code
./opentr.sh build # Rebuild all containers
Database Management
# Data operations (β οΈ DESTRUCTIVE)
./opentr.sh reset [dev|prod] # Complete reset - deletes ALL data!
./opentr.sh init-db # Initialize database without container reset
# Backup and restore
./opentr.sh backup # Create timestamped database backup
./opentr.sh restore [file] # Restore from backup file
System Administration
# Maintenance
./opentr.sh clean # Remove unused containers and images
./opentr.sh health # Check service health status
./opentr.sh shell [service] # Open shell in container
# Available services: backend, frontend, postgres, redis, minio, opensearch, celery-worker
Monitoring and Debugging
# View specific service logs
./opentr.sh logs backend # API server logs
./opentr.sh logs celery-worker # AI processing logs
./opentr.sh logs frontend # Frontend development logs
./opentr.sh logs postgres # Database logs
# Follow logs in real-time
./opentr.sh logs backend -f
π― Usage Guide
Getting Started
User Registration
- Navigate to http://localhost:5173
- Create an account or use default admin credentials
- Set up your profile and preferences
Upload or Record Content
- File Upload: Click "Upload Files" or drag-and-drop media files (up to 4GB)
- Direct Recording: Use the microphone button in the navbar for browser-based recording
- URL Processing: Paste YouTube or media URLs for automatic processing
- Supported formats: MP3, WAV, MP4, MOV, and more
- Files are automatically queued for concurrent processing
Monitor Processing
- Watch detailed real-time progress with 13 processing stages
- Use the floating upload manager for multi-file progress tracking
- View task status in Flower monitor or notifications panel
- Receive live WebSocket notifications for all status changes
Explore Your Content
- Interactive Transcript: Click on transcript text to navigate media playback
- Waveform Player: Click on audio waveform for precise seeking
- Custom Titles: Edit file display names for better organization and searchability
- Speaker Management: Edit speaker names and add custom labels
- AI Summaries: Generate BLUF format summaries with custom prompts
- Comments: Add time-stamped comments and annotations
- Collections: Organize files into themed collections
- Full-Screen View: Use transcript modal for detailed reading and searching
Configure AI Features (Optional)
- Set up LLM providers in User Settings for AI summarization
- Create custom prompts for different content types
- Test provider connections before processing
Advanced Features
Recording Workflow
ποΈ Device Selection β π Level Monitoring β βΈοΈ Session Control β β¬οΈ Background Upload
- Choose from available microphone devices
- Monitor real-time audio levels during recording
- Pause/resume recording sessions with duration tracking
- Seamless integration with background upload processing
AI-Powered Processing
π€ LLM Configuration β π Custom Prompts β π Content Analysis β π BLUF Summaries
- Configure multiple LLM providers (OpenAI, Claude, vLLM, Ollama, etc.)
- Create custom prompts for different content types (meetings, interviews, podcasts)
- Test provider connections and validate configurations
- Generate structured summaries with action items and key decisions
Speaker Management
π₯ Automatic Detection β π€ AI Recognition β π·οΈ Profile Management β π Cross-Media Tracking
- Speakers are automatically detected and assigned labels using advanced AI diarization
- AI suggests speaker identities based on voice fingerprinting across your media library
- Create global speaker profiles that persist across all your transcriptions
- Accept or reject AI suggestions with confidence scores to improve accuracy over time
- Track speaker appearances across multiple media files with detailed analytics
Advanced Upload Management
β¬οΈ Concurrent Uploads β π Progress Tracking β π Retry Logic β π Queue Management
- Floating, draggable upload manager with real-time progress
- Multiple file uploads with intelligent queue processing
- Automatic retry logic for failed uploads with exponential backoff
- Duplicate detection with hash verification
Search and Discovery
π Keyword Search β π§ Semantic Search β π·οΈ Smart Filtering β π― Waveform Navigation
- Search transcript content with advanced filters
- Use semantic search to find related concepts
- Click-to-seek navigation via interactive waveform visualization
- Organize content with custom tags and categories
Collections Management
π Create Collections β π Organize Files β π·οΈ Bulk Operations β π― Inline Editing
- Group related media files into named collections
- Inline collection editing with tag-style interface
- Filter library view by specific collections
- Bulk add/remove files from collections with drag-and-drop support
Real-Time Notifications
π Progress Updates β π Status Tracking β π WebSocket Integration β β
Completion Alerts
- Persistent notification panel with unread count badges
- Real-time updates for transcription, summarization, and upload progress
- WebSocket integration for instant status updates
- Smart notification grouping and auto-refresh systems
Export and Integration
π Multiple Formats β πΊ Subtitle Files β π API Access β π¬ Media Downloads
- Export transcripts as TXT, JSON, or CSV
- Generate SRT/VTT subtitle files with embedded timing
- Access data programmatically via comprehensive REST API
- Download media files with embedded subtitles
π Project Structure
OpenTranscribe/
βββ π backend/ # Python FastAPI backend
β βββ π app/ # Application modules
β β βββ π api/ # REST API endpoints
β β βββ π models/ # Database models
β β βββ π services/ # Business logic
β β βββ π tasks/ # Background AI processing
β β βββ π utils/ # Common utilities
β β βββ π db/ # Database configuration
β βββ π scripts/ # Admin and maintenance scripts
β βββ π tests/ # Comprehensive test suite
β βββ π README.md # Backend documentation
βββ π frontend/ # Svelte frontend application
β βββ π src/ # Source code
β β βββ π components/ # Reusable UI components
β β βββ π routes/ # Page components
β β βββ π stores/ # State management
β β βββ π styles/ # CSS and themes
β βββ π README.md # Frontend documentation
βββ π database/ # Database initialization
βββ π models_ai/ # AI model storage (runtime)
βββ π scripts/ # Utility scripts
βββ π docker-compose.yml # Container orchestration
βββ π opentr.sh # Main utility script
βββ π README.md # This file
π§ Configuration
Environment Variables
Core Application
# Database
DATABASE_URL=postgresql://postgres:password@postgres:5432/opentranscribe
# Security
SECRET_KEY=your-super-secret-key-here
JWT_SECRET_KEY=your-jwt-secret-key
# Object Storage
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
MINIO_BUCKET_NAME=transcribe-app
AI Processing
# Required for speaker diarization - see setup instructions below
HUGGINGFACE_TOKEN=your_huggingface_token_here
# Model configuration
WHISPER_MODEL=large-v2 # large-v2, medium, small, base
COMPUTE_TYPE=float16 # float16, int8
BATCH_SIZE=16 # Reduce if GPU memory limited
# Speaker detection
MIN_SPEAKERS=1 # Minimum speakers to detect
MAX_SPEAKERS=10 # Maximum speakers to detect
# Model caching (recommended)
MODEL_CACHE_DIR=./models # Directory to store downloaded AI models
LLM Configuration (AI Features)
OpenTranscribe offers flexible AI deployment options. Choose the approach that best fits your infrastructure:
π§ Quick Setup Options:
Cloud-Only (Recommended for Most Users)
# Configure for OpenAI in .env
LLM_PROVIDER=openai
OPENAI_API_KEY=your_openai_key
OPENAI_MODEL_NAME=gpt-4o-mini
# Start without local LLM
./opentr.sh start dev
Local vLLM (High-Performance GPUs)
# Configure for vLLM in .env
LLM_PROVIDER=vllm
VLLM_MODEL_NAME=gpt-oss-20b
# Start with vLLM service (requires 16GB+ VRAM)
docker compose -f docker-compose.yml -f docker-compose.vllm.yml up
Local Ollama (Consumer GPUs)
# Configure for Ollama in .env
LLM_PROVIDER=ollama
OLLAMA_MODEL_NAME=llama3.2:3b-instruct-q4_K_M
# Edit docker-compose.vllm.yml and uncomment ollama service
# Then start with both compose files
docker compose -f docker-compose.yml -f docker-compose.vllm.yml up
π Complete Provider Configuration:
# Cloud Providers (configure in .env)
LLM_PROVIDER=openai # openai, anthropic, custom (openrouter)
OPENAI_API_KEY=your_openai_key # OpenAI GPT models
ANTHROPIC_API_KEY=your_claude_key # Anthropic Claude models
OPENROUTER_API_KEY=your_or_key # OpenRouter (multi-provider)
# Local Providers (requires additional Docker services)
LLM_PROVIDER=vllm # Local vLLM server
LLM_PROVIDER=ollama # Local Ollama server
π― Deployment Scenarios:
- π° Cost-Effective: OpenRouter with Claude Haiku (~$0.25/1M tokens)
- π Privacy-First: Local vLLM or Ollama (no data leaves your server)
- β‘ Performance: OpenAI GPT-4o-mini (fastest cloud option)
- π± Small Models: Even 3B Ollama models can handle hours of content via intelligent sectioning
- π« No LLM: Leave
LLM_PROVIDER
empty for transcription-only mode
See LLM_DEPLOYMENT_OPTIONS.md for detailed setup instructions.
ποΈ Model Caching
OpenTranscribe automatically downloads and caches AI models for optimal performance. Models are saved locally and reused across container restarts.
Default Setup:
- All models are cached to
./models/
directory in your project folder
- Models persist between Docker container restarts
- No re-downloading required after initial setup
Directory Structure:
./models/
βββ huggingface/ # PyAnnote + WhisperX models
β βββ hub/ # WhisperX transcription models (~1.5GB)
β βββ transformers/ # PyAnnote transformer models
βββ torch/ # PyTorch cache
βββ hub/checkpoints/ # Wav2Vec2 alignment model (~360MB)
βββ pyannote/ # PyAnnote diarization models (~500MB)
Custom Cache Location:
# Set custom directory in your .env file
MODEL_CACHE_DIR=/path/to/your/models
# Examples:
MODEL_CACHE_DIR=~/ai-models # Home directory
MODEL_CACHE_DIR=/mnt/storage/models # Network storage
MODEL_CACHE_DIR=./cache # Project subdirectory
Storage Requirements:
- WhisperX Models: ~1.5GB (depends on model size)
- PyAnnote Models: ~500MB (diarization + embedding)
- Alignment Model: ~360MB (Wav2Vec2)
- Total: ~2.5GB for complete setup```
π HuggingFace Token Setup
OpenTranscribe requires a HuggingFace token for speaker diarization and voice fingerprinting features. Follow these steps:
1. Generate HuggingFace Token
- Visit HuggingFace Settings > Access Tokens
- Click "New token" and select "Read" access
- Copy the generated token
2. Accept Model User Agreements
You must accept the user agreements for these models:
Add your token to the environment configuration:
For Production Installation:
# The setup script will prompt you for your token
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
For Manual Installation:
# Add to .env file
echo "HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" >> .env
Note: Without a valid HuggingFace token, speaker diarization will be disabled and speakers will not be automatically detected or identified across different media files.
# GPU settings
USE_GPU=true # Enable GPU acceleration
CUDA_VISIBLE_DEVICES=0 # GPU device selection
# Resource limits
MAX_UPLOAD_SIZE=4GB # Maximum file size (supports GoPro videos)
CELERY_WORKER_CONCURRENCY=2 # Concurrent tasks
Production Deployment
For production use, ensure you:
Security Configuration
# Generate strong secrets
openssl rand -hex 32 # For SECRET_KEY
openssl rand -hex 32 # For JWT_SECRET_KEY
# Set strong database passwords
# Configure proper firewall rules
# Set up SSL/TLS certificates
Performance Optimization
# Use production environment
NODE_ENV=production
# Configure resource limits
# Set up monitoring and logging
# Configure backup strategies
Reverse Proxy Setup
# Example NGINX configuration
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:5173;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /api {
proxy_pass http://localhost:5174;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
π§ͺ Development
Development Environment
# Start development with hot reload
./opentr.sh start dev
# Backend development
cd backend/
pip install -r requirements.txt
pytest tests/ # Run tests
black app/ # Format code
flake8 app/ # Lint code
# Frontend development
cd frontend/
npm install
npm run dev # Development server
npm run test # Run tests
npm run lint # Lint code
Testing
# Backend tests
./opentr.sh shell backend
pytest tests/ # All tests
pytest tests/api/ # API tests only
pytest --cov=app tests/ # With coverage
# Frontend tests
cd frontend/
npm run test # Unit tests
npm run test:e2e # End-to-end tests
npm run test:components # Component tests
Contributing
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
π Troubleshooting
Common Issues
GPU Not Detected
# Check GPU availability
nvidia-smi
# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Set CPU-only mode if needed
echo "USE_GPU=false" >> .env
Memory Issues
# Reduce model size
echo "WHISPER_MODEL=medium" >> .env
echo "BATCH_SIZE=8" >> .env
echo "COMPUTE_TYPE=int8" >> .env
# Monitor memory usage
docker stats
Slow Transcription
- Use GPU acceleration (
USE_GPU=true
)
- Reduce model size (
WHISPER_MODEL=medium
)
- Increase batch size if you have GPU memory
- Split large files into smaller segments
Database Connection Issues
# Reset database
./opentr.sh reset dev
# Check database logs
./opentr.sh logs postgres
# Verify database is running
./opentr.sh shell postgres psql -U postgres -l
Container Issues
# Check service status
./opentr.sh status
# Clean up resources
./opentr.sh clean
# Full reset (β οΈ deletes all data)
./opentr.sh reset dev
Getting Help
- π Documentation: Check README files in each component directory
- π Issues: Report bugs on GitHub Issues
- π¬ Discussions: Ask questions in GitHub Discussions
- π Monitoring: Use Flower dashboard for task debugging
Hardware Recommendations
Minimum Requirements
- 8GB RAM
- 4 CPU cores
- 50GB disk space
- Any modern GPU (optional but recommended)
Recommended Configuration
- 16GB+ RAM
- 8+ CPU cores
- 100GB+ SSD storage
- NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)
- High-speed internet for model downloads
Production Scale
- 32GB+ RAM
- 16+ CPU cores
- Multiple GPUs for parallel processing
- Fast NVMe storage
- Load balancer for multiple instances
# GPU optimization
COMPUTE_TYPE=float16 # Use half precision
BATCH_SIZE=32 # Increase for more GPU memory
WHISPER_MODEL=large-v2 # Best accuracy
# CPU optimization (if no GPU)
COMPUTE_TYPE=int8 # Use quantization
BATCH_SIZE=1 # Reduce memory usage
WHISPER_MODEL=base # Faster processing
π Security Considerations
Data Privacy
- All processing happens locally - no data sent to external services
- Optional: Disable external model downloads for air-gapped environments
- User data is encrypted at rest and in transit
- Configurable data retention policies
Access Control
- Role-based permissions (admin/user)
- File ownership validation
- API rate limiting
- Secure session management
Network Security
- All services run in isolated Docker network
- Configurable firewall rules
- Optional SSL/TLS termination
- Secure default configurations
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- OpenAI Whisper - Foundation speech recognition model
- WhisperX - Enhanced alignment and diarization
- PyAnnote.audio - Speaker diarization capabilities
- FastAPI - Modern Python web framework
- Svelte - Reactive frontend framework
- Docker - Containerization platform
π Useful Links
Built with β€οΈ using AI assistance and modern open-source technologies.
OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.