Generate 800K question-answer pairs for training small VLMs to detect GUI elements. Transform 80,000 base GUI elements into comprehensive training dataset using LLM paraphrasing.
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment:
cp .env.example .env # Add your API keys (at least one required): # CEREBRAS_API_KEY=your_cerebras_key # GROQ_API_KEY=your_groq_key
-
Prepare the dataset (if not already downloaded):
python prepare_dataset.py # Download and combine wave-ui dataset -
Run the main process:
python main.py # Interactive paraphrase generation
When you run main.py, you'll see interactive menus:
LLM Provider Selection:
- Cerebras - Fast inference with high performance
- Groq - Alternative provider with competitive speeds
Dataset Processing Options:
- 10 rows - Quick test (recommended for first run)
- 100 rows - Medium test
- 10,000 rows - Large test
- All rows - Full dataset processing
- Custom - Specify any number of rows
HuggingFace Hub Upload:
- Choose whether to upload the processed dataset to HuggingFace Hub
- Automatic upload with descriptive commit messages
- Configurable repository via
HF_REPO_IDenvironment variable
- Multi-Provider Support: Choose between Cerebras and Groq LLM providers
- Robust Retry Logic: 5-attempt retry with exponential backoff to minimize data loss
- English-Only Output: All generated questions are in English, regardless of source language
- Checkpoint Management: Automatic progress saving and resume capability
- Rate Limit Handling: Intelligent retry logic for API rate limits
- Flexible Processing: Choose exactly how many rows to process
- Direct Hub Integration: Seamless upload to HuggingFace Hub
For individual operations, you can import and use functions directly from the core modules:
from core.dataset_loader import load_and_combine_dataset
from core.hub_uploader import upload_dataset_to_hub- 📊 Scale: Transform 80,000 GUI elements → 800,000 question-answer pairs
- 🎯 Purpose: Train small VLMs for GUI element detection
- 🤖 Method: LLM-powered paraphrase generation for data augmentation
- 🌐 Open Source: Public dataset for community use
realGUI-800K/
├── main.py # Main paraphrase generation (interactive)
├── prepare_dataset.py # Dataset preparation script
├── requirements.txt # Dependencies
├── .env.example # Environment template
├── README.md # This file
├── core/ # Core functionality
│ ├── dataset_loader.py # Dataset operations
│ ├── llm_client.py # LLM interaction
│ └── hub_uploader.py # HuggingFace operations
└── utils/ # Utilities
├── config.py # Configuration
├── logger.py # Logging
└── checkpoint_manager.py # Resume functionality
Environment variables (set in .env):
# Required (at least one)
CEREBRAS_API_KEY=your_cerebras_api_key
GROQ_API_KEY=your_groq_api_key
# Optional - Processing Configuration
NUM_PARAPHRASES=10 # Number of paraphrases per element
SAVE_INTERVAL=50 # Save progress every N items
TEST_LIMIT=10 # Default test limit (overridden by interactive menu)
# Optional - Dataset Configuration
INPUT_DATASET=agentsea/wave-ui
OUTPUT_DATASET_LOCAL=realGUI_800K_dataset
COMBINED_DATASET_LOCAL=wave_ui_combined
# Optional - HuggingFace Hub
HF_REPO_ID=your_username/realGUI-800K
# Optional - Model Configuration
CEREBRAS_MODEL_NAME=gpt-oss-120b # Cerebras model
GROQ_MODEL_NAME=openai/gpt-oss-120b # Groq model
MODEL_TEMPERATURE=0.6
MODEL_MAX_TOKENS=65536
MODEL_MAX_RETRIES=2