Skip to content

sorcware/earthshock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 EarthShock

🎯 Problem Statement

What are we trying to solve?

  • Generate realistic synthetic data to address the lack of, or issues with existing, datasets.
  • Enable intelligent data generation based on metadata or prompts using LLM capabilities.

🛠️ Solution Approach

How we plan to tackle this:

  • Build an LLM-powered agent that intelligently generates synthetic data.
  • Leverage LLMs to understand context and metadata for informed data creation.
  • Create a system capable of producing realistic, meaningful data based on prompts or schema information.
  • Explore the potential benefits of utilizing multiple LLMs.

📋 Implementation

What we built:

While we developed more features than listed here, these points directly relate to our core goal.

  • Key Features/Components:
    • AI-powered data generation for small datasets (5-10 rows).
    • Intelligent filename generation and CSV saving.
  • Architecture Decisions:
    • Followed the "build from scratch" philosophy, avoiding external frameworks.
    • Utilized hand-fed JSON table schemas as input.
  • Code Structure Overview:
    • The overall structure mirrors Fireball's approach, with tools added to a tools class allowing the AI agent autonomous decision-making and tool selection.

📊 Retrospective

✅ What Worked Well

  • Rapid Onboarding & Code Portability: Marco was able to contribute within 60 minutes of starting, building his own tool and integrating it into the agent workflow. This demonstrates excellent code design and ease of use.
  • Successful Proof-of-Concept (PoC): We achieved our primary goal before lunchtime – generating realistic data for any given schema. This allowed us to focus on exploration and learning in the afternoon.
  • Agentic Workflow Understanding: Our clear understanding of agentic workflows, combined with our self-built code base, enabled rapid deployment.

❌ What Didn't Work

  • Inconsistencies at Scale: Generating large amounts of data solely through LLM calls resulted in noticeable inconsistencies.
  • Limited Scalability: Our custom build approach hinders scalability.
  • Debugging Challenges: Pasting JSON into the terminal triggered an LLM call for each line, causing unexpected delays and debugging time.

🔄 What We'd Do Differently

  • Prioritize Framework Adoption: Begin with a framework designed for scale from the outset.
  • Refactor Data Generation Tool: Develop a more Pythonic data generation tool powered by LLMs.
  • Implement Orchestration: Increase LLM orchestration to enable calls to specialized agents for improved scalability and customization.

🎉 What We Learned

Key takeaways from this project:

  • While our self-build approach was valuable for knowledge acquisition, we should transition to a framework to accelerate development and improve scalability.
  • Data generation is manageable for LLMs on a small scale with schema input; however, scaling to larger datasets (1,000 - 10,000 rows) requires further investigation.
  • Orchestration and specialized agents are crucial for achieving greater scale and bespoke code/prompt handling.

Next Steps

  • Framework Evaluation: Research the pros and cons of PydanticAi & DSPy frameworks.
  • Scalable Solution Planning: Design a more scalable AI Agent solution, considering both code complexity and task complexity. This should incorporate our existing features.
  • Future Hack Day: Schedule another hack day once planning is complete.

Made with 💻 and lots of ☕ Let's keep throwing spaghetti! 🍝

About

DSPy exploration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages