Analyze 340K+ Spotify play events through a 22-module Python pipeline with K-means clustering, graph algorithms, and NLP enrichment. Interactive Svelte 5 frontend with D3.js + GSAP visualizations.
Pipeline & Data Processing
ML & Statistical Analysis
Frontend Visualization
Architecture
Spotify Streaming History (JSON)
↓
[ingest.py] → SQLite plays table (341K+ records)
↓
[enrich_*] → Parallel enrichment (Last.fm, MusicBrainz, Wikipedia, Deezer, LrcLib)
↓
[pipeline/analysis/a01–a22] → 22 analysis modules
├─ a01: Overview stats & heatmaps
├─ a02-a06: Sessions, skips, platforms, replays
├─ a07-a08: Albums, mood timeline
├─ a09: Audio clusters (K-means)
├─ a10-a15: Audio feature arcs, genres, artist lifecycle, networks, age, popularity
├─ a16-a17: Lyrics themes & lines
├─ a19-a22: Personality, seasonal patterns, absences
↓
[JSON outputs → build/]
↓
[Svelte Frontend] → 102 components, 6 interactive "acts"
↓
HTML/CSS/JS (static site)
| Module | Category | Description |
|---|---|---|
| a01_overview | Stats | Total plays, unique tracks/artists, listening hours, year range |
| a02_sessions | Sessions | Listening session detection, duration patterns |
| a03_skips | Behavior | Skip analysis, early exit rates |
| a04_circadian | Patterns | Hour-of-day, day-of-week patterns |
| a05_platforms | Context | Device/platform breakdown (web, mobile, desktop) |
| a06_replays | Behavior | Replay frequency, favorites |
| a07_albums | Collections | Album stats, top albums |
| a08_mood_timeline | Trends | Mood sentiment over time |
| a09_audio_clusters | ML | K-means clusters on audio features |
| a10_feature_arcs | Trends | Audio feature evolution |
| a11_genres | Collections | Genre distribution, trends |
| a12_artist_lifecycle | Trends | Artist discovery, activity, decline |
| a13_network | Graph | Co-listening network, community detection |
| a14_musical_age | Stats | Artist debut, listener tenure |
| a15_popularity | Stats | Popularity scores, Spotify metrics |
| a16_lyrics_themes | NLP | TF-IDF themes, emotion dictionaries |
| a17_lyrics_lines | NLP | Lyric line extraction & sentiment |
| a19_personality | NLP | Listening personality profile |
| a20_seasonal | Patterns | Seasonal trends |
| a22_absences | Behavior | Gaps in listening, offline periods |
Prerequisites: Python 3.9+, Node.js 18+
# 1. Install Python dependencies
pip install -r requirements.txt
# 2. Ingest Spotify history (requires JSON files in streaming_history/)
python -m pipeline.ingest --db music_recap.db
# 3. Enrich metadata (Last.fm, MusicBrainz, etc.)
python -m pipeline.enrich_artists --db music_recap.db
# 4. Run analysis pipeline
python -m pipeline.analysis --db music_recap.db --build build
# 5. Start frontend dev server
cd frontend && npm install && npm run dev
# Open http://localhost:5173
Commands
make ingest # Run ingestion
make enrich-artists # Enrich artist metadata
make analyze # Run full analysis pipeline
make test # Run pytest
make clean # Remove database
| Layer | Technology | Role |
|---|---|---|
| Backend | Python 3.x | Pipeline orchestration |
| Database | SQLite3 (WAL) | 341K+ play events, metadata cache |
| ML/Stats | NumPy, SciPy | K-means, t-tests, correlations |
| Enrichment | Requests, parallel | Last.fm, MusicBrainz, Wikipedia, Deezer APIs |
| Frontend | Svelte 5, SvelteKit | Reactive component framework |
| Visualization | D3.js, GSAP | Charts, animations, interactions |
| Styling | Tailwind CSS | Responsive design |
| Testing | Pytest | Unit tests for pipelines |
Created for santifer.io. Portfolio: cv-santiago