-
Notifications
You must be signed in to change notification settings - Fork 3
feat: Implement SchemaFormatter #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement SchemaFormatter #148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a schema formatting system for LLM prompts, enabling structured representation of database tables with their schemas and sample data.
Key Changes:
- Introduces
SchemaFormatterabstract base class andSimpleSchemaFormatterimplementation for formatting table schemas - Adds
HeadSamplerfor basic data sampling functionality - Includes comprehensive test coverage for the new formatting components
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| agentune/core/formatter/base.py | Adds SchemaFormatter abstract class with helper methods for schema serialization and sample data formatting |
| agentune/core/formatter/schema.py | Implements SimpleSchemaFormatter that formats all tables with schemas and sample data in markdown format |
| agentune/core/sampler/base.py | Adds HeadSampler class for selecting the first N rows from a dataset |
| tests/agentune/core/formatter/test_schema.py | Provides test fixtures and test cases for SimpleSchemaFormatter functionality |
| tests/agentune/core/formatter/init.py | Adds module docstring for formatter test package |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
agentune/core/formatter/base.py
Outdated
| # Convert Dtype to simple string representation | ||
| dtype_str = repr(field.dtype.polars_type) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using repr() on polars types may produce verbose output like <class 'polars.datatypes.Int32'> instead of cleaner names like Int32. Consider using str(field.dtype.polars_type) or extracting just the type name for more readable LLM prompts.
| # Convert Dtype to simple string representation | |
| dtype_str = repr(field.dtype.polars_type) | |
| # Convert Dtype to simple, readable string representation | |
| polars_type = field.dtype.polars_type | |
| dtype_str = getattr(polars_type, "__name__", str(polars_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this in practice when looking into Opik?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Polars type? It need the be the SQL type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is always defined, and it works in practice, the types look like:
Int32
Float64
Date
Boolean
Enum(categories=['category A', 'category B'])
String
agentune/core/formatter/base.py
Outdated
| # Convert Dtype to simple string representation | ||
| dtype_str = repr(field.dtype.polars_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this in practice when looking into Opik?
agentune/core/formatter/schema.py
Outdated
| """ | ||
| sections = [] | ||
|
|
||
| # Format primary table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting would be clearer with a helper function I think please consolidate these.
| formatter = SimpleSchemaFormatter(num_samples=3) | ||
| result = formatter.format_all_tables(primary_dataset, tables_with_strategies, conn) | ||
|
|
||
| # Print the actual output for inspection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we would want to have prints inside tests. This is fine when you debug but I'm not sure we would want to merge it.
leonidb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatter assumes we always use markdown for prompts. I think we can make such assumption, by I wouldn't assume a certain level of heading, I think it should accept the level under which the schema is nested
agentune/core/formatter/schema.py
Outdated
| num_samples: int = 5 | ||
| sampler: DataSampler = HeadSampler() | ||
|
|
||
| def _serialize_schema_and_samples(self, schema: Schema, sample_data: Dataset) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_serialize_schema_and_samples - format, not serialize
It could be a TableFormatter class actually, injected into the Schema Formatter constructor and used for every table sample.
@attrs.frozen
class TableFormatter(ABC, UseTypeTag):
"""Formats a single table/dataset to string."""
@abstractmethod
def format_table(self, dataset: Dataset) -> str:
"""Format a dataset to string representation."""
...
It can also deal with table schema and samples formatting.
Table format needs to deal with corner cases. There are a lot of them and we can define them separately later, but I think we should at least truncate strings to avoid breaking everything if there is one column with long text. I don't think csv formatter does that by default. Can you check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed all of it to TablesFormatter, it makes more sense anyway, and I made the format_table part of the abs api.
agentune/core/formatter/base.py
Outdated
| # Convert Dtype to simple string representation | ||
| dtype_str = repr(field.dtype.polars_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Polars type? It need the be the SQL type
7f172ba to
cd0cdae
Compare
|
I think we should truncate string values in samples, to lets say a 100 chars (configurable in table formatter) |
agentune/core/formatter/tables.py
Outdated
| if field.dtype.polars_type in (pl.String, pl.Utf8): | ||
| # Truncate long strings | ||
| select_exprs.append( | ||
| pl.when(pl.col(col_name).str.len_bytes() > self.max_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use str.len_chars() instead of str.len_bytes(), comparison can be incorrect, depending on the encoding of the strings.
agentune/core/formatter/tables.py
Outdated
| for field in dataset.schema.cols: | ||
| col_name = field.name | ||
| # Check if column is a string type | ||
| if field.dtype.polars_type in (pl.String, pl.Utf8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We only use a specified set of types, and pl.Utf8 is not one of them, so it's enough to check pl.String
@danarmak
probably even better to rely on the field.dtype directly right? With something like field.dtype==types.string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, field.dtype can be tested directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing field.dtype in (types.string, types.json_dtype) to avoid large strings and large jsons (we may want a different truncation for json in the future)
f125c36 to
be4f8cb
Compare
afefa2e to
0cdc2ab
Compare
be4f8cb to
abd4a31
Compare
5e9e904 to
8a6ca93
Compare
0e825c5 to
a1a9ec2
Compare
…r for schema formatting and sampling
8a6ca93 to
2fb61a2
Compare
Implement SchemaFormatter, SimpleSchemaFormatter and HeadSampler for schema formatting and sampling
Changes
Related Issues
closes https://github.com/SparkBeyond/ao-core/issues/91