What are we trying to solve?
- Generate realistic synthetic data to address the lack of, or issues with existing, datasets.
- Enable intelligent data generation based on metadata or prompts using LLM capabilities.
How we plan to tackle this:
- Build an LLM-powered agent that intelligently generates synthetic data.
- Leverage LLMs to understand context and metadata for informed data creation.
- Create a system capable of producing realistic, meaningful data based on prompts or schema information.
- Explore the potential benefits of utilizing multiple LLMs.
What we built:
While we developed more features than listed here, these points directly relate to our core goal.
- Key Features/Components:
- AI-powered data generation for small datasets (5-10 rows).
- Intelligent filename generation and CSV saving.
- Architecture Decisions:
- Followed the "build from scratch" philosophy, avoiding external frameworks.
- Utilized hand-fed JSON table schemas as input.
- Code Structure Overview:
- The overall structure mirrors Fireball's approach, with tools added to a
toolsclass allowing the AI agent autonomous decision-making and tool selection.
- The overall structure mirrors Fireball's approach, with tools added to a
- Rapid Onboarding & Code Portability: Marco was able to contribute within 60 minutes of starting, building his own tool and integrating it into the agent workflow. This demonstrates excellent code design and ease of use.
- Successful Proof-of-Concept (PoC): We achieved our primary goal before lunchtime – generating realistic data for any given schema. This allowed us to focus on exploration and learning in the afternoon.
- Agentic Workflow Understanding: Our clear understanding of agentic workflows, combined with our self-built code base, enabled rapid deployment.
- Inconsistencies at Scale: Generating large amounts of data solely through LLM calls resulted in noticeable inconsistencies.
- Limited Scalability: Our custom build approach hinders scalability.
- Debugging Challenges: Pasting JSON into the terminal triggered an LLM call for each line, causing unexpected delays and debugging time.
- Prioritize Framework Adoption: Begin with a framework designed for scale from the outset.
- Refactor Data Generation Tool: Develop a more Pythonic data generation tool powered by LLMs.
- Implement Orchestration: Increase LLM orchestration to enable calls to specialized agents for improved scalability and customization.
Key takeaways from this project:
- While our self-build approach was valuable for knowledge acquisition, we should transition to a framework to accelerate development and improve scalability.
- Data generation is manageable for LLMs on a small scale with schema input; however, scaling to larger datasets (1,000 - 10,000 rows) requires further investigation.
- Orchestration and specialized agents are crucial for achieving greater scale and bespoke code/prompt handling.
- Framework Evaluation: Research the pros and cons of PydanticAi & DSPy frameworks.
- Scalable Solution Planning: Design a more scalable AI Agent solution, considering both code complexity and task complexity. This should incorporate our existing features.
- Future Hack Day: Schedule another hack day once planning is complete.
Made with 💻 and lots of ☕ Let's keep throwing spaghetti! 🍝