Skip to content

Generate 800K GUI element question-answer pairs for training small Vision Language Models. Transform 80,000 base GUI elements into comprehensive training dataset using LLM-powered paraphrase generation.

Notifications You must be signed in to change notification settings

MaharshPatelX/realGUI-800K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

realGUI-800K

Generate 800K question-answer pairs for training small VLMs to detect GUI elements. Transform 80,000 base GUI elements into comprehensive training dataset using LLM paraphrasing.

Setup

  1. Install dependencies:

    pip install -r requirements.txt
  2. Configure environment:

    cp .env.example .env
    # Add your API keys (at least one required):
    # CEREBRAS_API_KEY=your_cerebras_key
    # GROQ_API_KEY=your_groq_key  

Usage

Quick Start

  1. Prepare the dataset (if not already downloaded):

    python prepare_dataset.py   # Download and combine wave-ui dataset
  2. Run the main process:

    python main.py              # Interactive paraphrase generation

Interactive Features

When you run main.py, you'll see interactive menus:

LLM Provider Selection:

  • Cerebras - Fast inference with high performance
  • Groq - Alternative provider with competitive speeds

Dataset Processing Options:

  • 10 rows - Quick test (recommended for first run)
  • 100 rows - Medium test
  • 10,000 rows - Large test
  • All rows - Full dataset processing
  • Custom - Specify any number of rows

HuggingFace Hub Upload:

  • Choose whether to upload the processed dataset to HuggingFace Hub
  • Automatic upload with descriptive commit messages
  • Configurable repository via HF_REPO_ID environment variable

Key Features

  • Multi-Provider Support: Choose between Cerebras and Groq LLM providers
  • Robust Retry Logic: 5-attempt retry with exponential backoff to minimize data loss
  • English-Only Output: All generated questions are in English, regardless of source language
  • Checkpoint Management: Automatic progress saving and resume capability
  • Rate Limit Handling: Intelligent retry logic for API rate limits
  • Flexible Processing: Choose exactly how many rows to process
  • Direct Hub Integration: Seamless upload to HuggingFace Hub

Manual Operations

For individual operations, you can import and use functions directly from the core modules:

from core.dataset_loader import load_and_combine_dataset
from core.hub_uploader import upload_dataset_to_hub

Project Goals

  • 📊 Scale: Transform 80,000 GUI elements → 800,000 question-answer pairs
  • 🎯 Purpose: Train small VLMs for GUI element detection
  • 🤖 Method: LLM-powered paraphrase generation for data augmentation
  • 🌐 Open Source: Public dataset for community use

Structure

realGUI-800K/
├── main.py                 # Main paraphrase generation (interactive)
├── prepare_dataset.py      # Dataset preparation script
├── requirements.txt       # Dependencies
├── .env.example           # Environment template
├── README.md              # This file
├── core/                  # Core functionality
│   ├── dataset_loader.py  # Dataset operations
│   ├── llm_client.py      # LLM interaction
│   └── hub_uploader.py    # HuggingFace operations
└── utils/                 # Utilities
    ├── config.py          # Configuration
    ├── logger.py          # Logging
    └── checkpoint_manager.py # Resume functionality

Configuration

Environment variables (set in .env):

# Required (at least one)
CEREBRAS_API_KEY=your_cerebras_api_key
GROQ_API_KEY=your_groq_api_key

# Optional - Processing Configuration
NUM_PARAPHRASES=10           # Number of paraphrases per element
SAVE_INTERVAL=50             # Save progress every N items
TEST_LIMIT=10                # Default test limit (overridden by interactive menu)

# Optional - Dataset Configuration  
INPUT_DATASET=agentsea/wave-ui
OUTPUT_DATASET_LOCAL=realGUI_800K_dataset
COMBINED_DATASET_LOCAL=wave_ui_combined

# Optional - HuggingFace Hub
HF_REPO_ID=your_username/realGUI-800K

# Optional - Model Configuration
CEREBRAS_MODEL_NAME=gpt-oss-120b           # Cerebras model
GROQ_MODEL_NAME=openai/gpt-oss-120b       # Groq model
MODEL_TEMPERATURE=0.6
MODEL_MAX_TOKENS=65536
MODEL_MAX_RETRIES=2

Releases

No releases published

Packages

No packages published

Languages