An instruction-tuned language model specialized for the Ethereum ecosystem, built by fine-tuning FLAN-T5 on a curated corpus of Ethereum documentation.
EthGPT addresses the challenge of answering Ethereum-specific technical questions by fine-tuning a smaller language model (FLAN-T5-small) on high-quality Ethereum documentation. The project compares full fine-tuning vs. LoRA (Low-Rank Adaptation) approaches.
- 21% lower training loss with full fine-tuning compared to LoRA
- Improved Ethereum-specific question answering accuracy over general-purpose LLMs
- Successfully handles complex queries about EIPs, protocol changes, and Ethereum concepts
Curated Ethereum Corpus containing:
- Ethereum Improvement Proposals (EIPs)
- Protocol documentation
- Technical whitepapers
- Ethereum.org content
Format: Instruction-following format with instruction, input (context), and output fields
Sample Data Structure:
{
"instruction": "What voting strategies does EIP-225 recommend?",
"input": "This document EIP-225, belongs to the Core category...",
"output": "Voting strategies. Reorgs can drop singleton votes..."
}- FLAN-T5-small from Google
- Sequence-to-sequence architecture
- Pre-trained on instruction-following tasks
- Updates all model parameters
- Higher computational cost
- Best performance: 21% lower training loss
- Parameter-efficient training
- Updates only low-rank decomposition matrices
- Configuration:
r=16(rank)lora_alpha=32lora_dropout=0.1- Target modules:
q,v(query and value projections)
Training Hyperparameters:
- Epochs: 5
- Batch size: 2 per device
- Gradient accumulation: 4 steps
- Learning rate: 5e-4
- Weight decay: 0.01
- Max sequence length: 256 tokens
- Warmup ratio: 0.1- Format inputs as:
"Question: {instruction} Context: {input}" - Tokenize with truncation at 256 tokens
- 80/20 train/validation split
- Dynamic padding via
DataCollatorForSeq2Seq
- Uses HuggingFace
Seq2SeqTrainer - Evaluation every 50 steps
- Checkpoint saving every 100 steps
- Early stopping based on validation loss
- Automatic model saving on interruption
python EthGPT.pypython EthGPT-LoRA.pyfrom transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("./eth-model-final")
tokenizer = T5Tokenizer.from_pretrained("./eth-model-final")
input_text = "Question: What is EIP-1559?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)- Pre-trained on instruction-following tasks (aligned with our use case)
- Good balance of performance and computational efficiency
- Strong few-shot learning capabilities
- Full fine-tuning: Maximum adaptation to domain-specific knowledge
- LoRA: Resource-efficient alternative for deployment scenarios
- Empirical comparison shows 21% training loss improvement with full fine-tuning
- Focused on high-quality, authoritative Ethereum sources
- Structured instruction-response format for better alignment
- Manual curation and quality control of corpus
| Approach | Training Loss (Final) | Relative Improvement |
|---|---|---|
| LoRA | Baseline | - |
| Full Fine-Tuning | 21% lower | +21% |
- Better handling of technical EIP-specific queries
- Accurate protocol change explanations
- Contextually appropriate responses for Ethereum terminology
torch
transformers
peft (for LoRA)
datasets
- Scale to larger FLAN-T5 variants (base, large)
- Expand corpus with smart contract examples
- Add retrieval-augmented generation (RAG) for real-time EIP updates
- Benchmark against GPT-3.5/GPT-4 on Ethereum-specific tasks
- Deploy as API for developer tools integration