Skip to content

me-hem/EthGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EthGPT: Instruction-Tuned LLM for Ethereum

An instruction-tuned language model specialized for the Ethereum ecosystem, built by fine-tuning FLAN-T5 on a curated corpus of Ethereum documentation.

Overview

EthGPT addresses the challenge of answering Ethereum-specific technical questions by fine-tuning a smaller language model (FLAN-T5-small) on high-quality Ethereum documentation. The project compares full fine-tuning vs. LoRA (Low-Rank Adaptation) approaches.

Key Results

  • 21% lower training loss with full fine-tuning compared to LoRA
  • Improved Ethereum-specific question answering accuracy over general-purpose LLMs
  • Successfully handles complex queries about EIPs, protocol changes, and Ethereum concepts

Dataset

Curated Ethereum Corpus containing:

  • Ethereum Improvement Proposals (EIPs)
  • Protocol documentation
  • Technical whitepapers
  • Ethereum.org content

Format: Instruction-following format with instruction, input (context), and output fields

Sample Data Structure:

{
    "instruction": "What voting strategies does EIP-225 recommend?",
    "input": "This document EIP-225, belongs to the Core category...",
    "output": "Voting strategies. Reorgs can drop singleton votes..."
}

Architecture

Base Model

  • FLAN-T5-small from Google
  • Sequence-to-sequence architecture
  • Pre-trained on instruction-following tasks

Two Training Approaches

1. Full Fine-Tuning (EthGPT.py)

  • Updates all model parameters
  • Higher computational cost
  • Best performance: 21% lower training loss

2. LoRA Fine-Tuning (EthGPT-LoRA.py)

  • Parameter-efficient training
  • Updates only low-rank decomposition matrices
  • Configuration:
    • r=16 (rank)
    • lora_alpha=32
    • lora_dropout=0.1
    • Target modules: q, v (query and value projections)

Training Configuration

Training Hyperparameters:
- Epochs: 5
- Batch size: 2 per device
- Gradient accumulation: 4 steps
- Learning rate: 5e-4
- Weight decay: 0.01
- Max sequence length: 256 tokens
- Warmup ratio: 0.1

Implementation Details

Data Preprocessing

  1. Format inputs as: "Question: {instruction} Context: {input}"
  2. Tokenize with truncation at 256 tokens
  3. 80/20 train/validation split
  4. Dynamic padding via DataCollatorForSeq2Seq

Training Pipeline

  • Uses HuggingFace Seq2SeqTrainer
  • Evaluation every 50 steps
  • Checkpoint saving every 100 steps
  • Early stopping based on validation loss
  • Automatic model saving on interruption

Usage

Full Fine-Tuning

python EthGPT.py

LoRA Fine-Tuning

python EthGPT-LoRA.py

Inference Example

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("./eth-model-final")
tokenizer = T5Tokenizer.from_pretrained("./eth-model-final")

input_text = "Question: What is EIP-1559?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

Key Technical Decisions

Why FLAN-T5?

  • Pre-trained on instruction-following tasks (aligned with our use case)
  • Good balance of performance and computational efficiency
  • Strong few-shot learning capabilities

Why Compare Full Fine-Tuning vs. LoRA?

  • Full fine-tuning: Maximum adaptation to domain-specific knowledge
  • LoRA: Resource-efficient alternative for deployment scenarios
  • Empirical comparison shows 21% training loss improvement with full fine-tuning

Data Quality Over Quantity

  • Focused on high-quality, authoritative Ethereum sources
  • Structured instruction-response format for better alignment
  • Manual curation and quality control of corpus

Results & Insights

Performance Comparison

Approach Training Loss (Final) Relative Improvement
LoRA Baseline -
Full Fine-Tuning 21% lower +21%

Qualitative Improvements

  • Better handling of technical EIP-specific queries
  • Accurate protocol change explanations
  • Contextually appropriate responses for Ethereum terminology

Dependencies

torch
transformers
peft (for LoRA)
datasets

Future Work

  • Scale to larger FLAN-T5 variants (base, large)
  • Expand corpus with smart contract examples
  • Add retrieval-augmented generation (RAG) for real-time EIP updates
  • Benchmark against GPT-3.5/GPT-4 on Ethereum-specific tasks
  • Deploy as API for developer tools integration

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages