realGUI-800K

Generate 800K question-answer pairs for training small VLMs to detect GUI elements. Transform 80,000 base GUI elements into comprehensive training dataset using LLM paraphrasing.

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment:

cp .env.example .env
# Add your API keys (at least one required):
# CEREBRAS_API_KEY=your_cerebras_key
# GROQ_API_KEY=your_groq_key

Usage

Quick Start

Prepare the dataset (if not already downloaded):

python prepare_dataset.py   # Download and combine wave-ui dataset

Run the main process:

python main.py              # Interactive paraphrase generation

Interactive Features

When you run main.py, you'll see interactive menus:

LLM Provider Selection:

Cerebras - Fast inference with high performance
Groq - Alternative provider with competitive speeds

Dataset Processing Options:

10 rows - Quick test (recommended for first run)
100 rows - Medium test
10,000 rows - Large test
All rows - Full dataset processing
Custom - Specify any number of rows

HuggingFace Hub Upload:

Choose whether to upload the processed dataset to HuggingFace Hub
Automatic upload with descriptive commit messages
Configurable repository via HF_REPO_ID environment variable

Key Features

Multi-Provider Support: Choose between Cerebras and Groq LLM providers
Robust Retry Logic: 5-attempt retry with exponential backoff to minimize data loss
English-Only Output: All generated questions are in English, regardless of source language
Checkpoint Management: Automatic progress saving and resume capability
Rate Limit Handling: Intelligent retry logic for API rate limits
Flexible Processing: Choose exactly how many rows to process
Direct Hub Integration: Seamless upload to HuggingFace Hub

Manual Operations

For individual operations, you can import and use functions directly from the core modules:

from core.dataset_loader import load_and_combine_dataset
from core.hub_uploader import upload_dataset_to_hub

Project Goals

📊 Scale: Transform 80,000 GUI elements → 800,000 question-answer pairs
🎯 Purpose: Train small VLMs for GUI element detection
🤖 Method: LLM-powered paraphrase generation for data augmentation
🌐 Open Source: Public dataset for community use

Structure

realGUI-800K/
├── main.py                 # Main paraphrase generation (interactive)
├── prepare_dataset.py      # Dataset preparation script
├── requirements.txt       # Dependencies
├── .env.example           # Environment template
├── README.md              # This file
├── core/                  # Core functionality
│   ├── dataset_loader.py  # Dataset operations
│   ├── llm_client.py      # LLM interaction
│   └── hub_uploader.py    # HuggingFace operations
└── utils/                 # Utilities
    ├── config.py          # Configuration
    ├── logger.py          # Logging
    └── checkpoint_manager.py # Resume functionality

Configuration

Environment variables (set in .env):

# Required (at least one)
CEREBRAS_API_KEY=your_cerebras_api_key
GROQ_API_KEY=your_groq_api_key

# Optional - Processing Configuration
NUM_PARAPHRASES=10           # Number of paraphrases per element
SAVE_INTERVAL=50             # Save progress every N items
TEST_LIMIT=10                # Default test limit (overridden by interactive menu)

# Optional - Dataset Configuration  
INPUT_DATASET=agentsea/wave-ui
OUTPUT_DATASET_LOCAL=realGUI_800K_dataset
COMBINED_DATASET_LOCAL=wave_ui_combined

# Optional - HuggingFace Hub
HF_REPO_ID=your_username/realGUI-800K

# Optional - Model Configuration
CEREBRAS_MODEL_NAME=gpt-oss-120b           # Cerebras model
GROQ_MODEL_NAME=openai/gpt-oss-120b       # Groq model
MODEL_TEMPERATURE=0.6
MODEL_MAX_TOKENS=65536
MODEL_MAX_RETRIES=2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

realGUI-800K

Setup

Usage

Quick Start

Interactive Features

Key Features

Manual Operations

Project Goals

Structure

Configuration

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
core		core
utils		utils
.env.example		.env.example
README.md		README.md
main.py		main.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt

MaharshPatelX/realGUI-800K

Folders and files

Latest commit

History

Repository files navigation

realGUI-800K

Setup

Usage

Quick Start

Interactive Features

Key Features

Manual Operations

Project Goals

Structure

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages