ML Dataset Manager ā A powerful desktop tool for preparing, filtering, and augmenting machine learning datasets.
| Tab | Function |
|---|---|
| š Scraper | Download datasets from HuggingFace, GitHub, or direct URLs with anti-blocking |
| āļø Split | Split large JSON/JSONL files into smaller chunks by MB size |
| š Convert | Convert Parquet ā JSON and any format ā Alpaca fine-tuning format |
| š Merge | Merge multiple Alpaca JSON files, auto-removing duplicates |
| š Filter | Filter by domain, quality, word count, required/excluded keywords |
| š§ Analyze | Smart Parse ā auto-detect format, analyze structure, get filter recommendations |
| š¤ Augment | Generate variations using Claude API to multiply your dataset |
š Scrape ā š§ Analyze ā š Filter ā š Convert to Alpaca ā š Merge ā šļø Train!
# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/oxygen.git
cd oxygen
# 2. Install Node dependencies
npm install
# 3. Install Python dependencies
pip install datasets huggingface_hub anthropic requests tqdm pandas pyarrow
# 4. Run in development mode
npm run tauri dev
# 5. Build for production
npm run tauri build
| Source | Example |
|---|---|
| HuggingFace Dataset ID | iamtarun/python_code_instructions_18k_alpaca |
| HuggingFace URL | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| Direct file URL | https://example.com/data.jsonl |
| GitHub raw file | https://github.com/user/repo/blob/main/data.json |
Anti-blocking features:
svelte5 python coding webdev blender zbrush unreal no filter
| Style | Description |
|---|---|
| š² Mixed | Random combination of all styles |
| āļø Rephrase | Reword the instruction differently |
| šŖ Harder | Add constraints and edge cases |
| š± Simpler | Make more beginner-friendly |
| š Different | Create related but distinct task |
oxygen/
āāā src/ # Svelte 5 frontend
ā āāā routes/
ā āāā +page.svelte # Main UI (7 tabs)
āāā src-tauri/ # Tauri/Rust backend
ā āāā src/
ā āāā lib.rs # Tauri commands
āāā python/ # Python scripts
ā āāā smart_scraper.py # Dataset downloader
ā āāā split_jsonl.py # JSONL splitter
ā āāā split_json.py # JSON splitter
ā āāā parquet_to_json.py # Parquet converter
ā āāā convert_to_alpaca.py# Alpaca formatter
ā āāā merge_alpaca.py # Dataset merger
ā āāā filter_dataset.py # Dataset filter
ā āāā smart_parse.py # Format analyzer
ā āāā generate_variations.py # AI augmentation
āāā README.md
Oxygen auto-detects and handles:
instruction / input / outputmessages[user/assistant]conversations[]problem / code / reasoningprompt / completionquestion / answerThis project is licensed under the GNU General Public License v3.0 ā see the LICENSE file for details.