oxygen Svelte Themes

Oxygen

ML Dataset Manager — Scrape, filter, convert and augment AI training datasets. Built with Tauri + Svelte 5 + Python.

Oxygen Logo

šŸ‚ Oxygen

ML Dataset Manager — A powerful desktop tool for preparing, filtering, and augmenting machine learning datasets.


✨ Features

Tab Function
🌐 Scraper Download datasets from HuggingFace, GitHub, or direct URLs with anti-blocking
āœ‚ļø Split Split large JSON/JSONL files into smaller chunks by MB size
šŸ”„ Convert Convert Parquet → JSON and any format → Alpaca fine-tuning format
šŸ”— Merge Merge multiple Alpaca JSON files, auto-removing duplicates
šŸ” Filter Filter by domain, quality, word count, required/excluded keywords
🧠 Analyze Smart Parse — auto-detect format, analyze structure, get filter recommendations
šŸ¤– Augment Generate variations using Claude API to multiply your dataset

šŸš€ Typical Workflow

🌐 Scrape → 🧠 Analyze → šŸ” Filter → šŸ”„ Convert to Alpaca → šŸ”— Merge → šŸ‹ļø Train!

šŸ“‹ Requirements

  • Node.js 18+
  • Rust (for Tauri)
  • Python 3.10+
  • pip packages (see below)

šŸ› ļø Installation

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/oxygen.git
cd oxygen

# 2. Install Node dependencies
npm install

# 3. Install Python dependencies
pip install datasets huggingface_hub anthropic requests tqdm pandas pyarrow

# 4. Run in development mode
npm run tauri dev

# 5. Build for production
npm run tauri build

🌐 Scraper — Supported Sources

Source Example
HuggingFace Dataset ID iamtarun/python_code_instructions_18k_alpaca
HuggingFace URL https://huggingface.co/datasets/teknium/OpenHermes-2.5
Direct file URL https://example.com/data.jsonl
GitHub raw file https://github.com/user/repo/blob/main/data.json

Anti-blocking features:

  • šŸ”„ User-Agent rotation (4 different browsers)
  • ā±ļø Retry with exponential backoff
  • šŸ“¦ HuggingFace streaming mode for large datasets
  • šŸ”‘ HF Token support for private datasets

šŸ” Filter — Supported Domains

svelte5 python coding webdev blender zbrush unreal no filter

šŸ¤– Augment — Variation Styles

Style Description
šŸŽ² Mixed Random combination of all styles
āœļø Rephrase Reword the instruction differently
šŸ’Ŗ Harder Add constraints and edge cases
🌱 Simpler Make more beginner-friendly
šŸ”€ Different Create related but distinct task

šŸ“ Project Structure

oxygen/
ā”œā”€ā”€ src/                    # Svelte 5 frontend
│   └── routes/
│       └── +page.svelte    # Main UI (7 tabs)
ā”œā”€ā”€ src-tauri/              # Tauri/Rust backend
│   └── src/
│       └── lib.rs          # Tauri commands
ā”œā”€ā”€ python/                 # Python scripts
│   ā”œā”€ā”€ smart_scraper.py    # Dataset downloader
│   ā”œā”€ā”€ split_jsonl.py      # JSONL splitter
│   ā”œā”€ā”€ split_json.py       # JSON splitter
│   ā”œā”€ā”€ parquet_to_json.py  # Parquet converter
│   ā”œā”€ā”€ convert_to_alpaca.py# Alpaca formatter
│   ā”œā”€ā”€ merge_alpaca.py     # Dataset merger
│   ā”œā”€ā”€ filter_dataset.py   # Dataset filter
│   ā”œā”€ā”€ smart_parse.py      # Format analyzer
│   └── generate_variations.py # AI augmentation
└── README.md

🧠 Supported Input Formats

Oxygen auto-detects and handles:

  • Alpaca — instruction / input / output
  • Messages — messages[user/assistant]
  • HuggingFace Conversations — conversations[]
  • Problem/Solution — problem / code / reasoning
  • Prompt/Completion — prompt / completion
  • Question/Answer — question / answer

šŸ“„ License

This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.


Built with ā¤ļø using Tauri + Svelte 5 + Python

Top categories

Loading Svelte Themes