Skip to content

Conversation

@yotam319-sparkbeyond
Copy link
Contributor

Implement SchemaFormatter, SimpleSchemaFormatter and HeadSampler for schema formatting and sampling

Changes

  • Implement SchemaFormatter abstract class.
  • Implement SimpleSchemaFormatter that Formats all available tables with their schemas and sample data to a single string
  • added HeadSampler as a basic DataSampler (used for SimpleSchemaFormatter)

Related Issues

closes https://github.com/SparkBeyond/ao-core/issues/91

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a schema formatting system for LLM prompts, enabling structured representation of database tables with their schemas and sample data.

Key Changes:

  • Introduces SchemaFormatter abstract base class and SimpleSchemaFormatter implementation for formatting table schemas
  • Adds HeadSampler for basic data sampling functionality
  • Includes comprehensive test coverage for the new formatting components

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
agentune/core/formatter/base.py Adds SchemaFormatter abstract class with helper methods for schema serialization and sample data formatting
agentune/core/formatter/schema.py Implements SimpleSchemaFormatter that formats all tables with schemas and sample data in markdown format
agentune/core/sampler/base.py Adds HeadSampler class for selecting the first N rows from a dataset
tests/agentune/core/formatter/test_schema.py Provides test fixtures and test cases for SimpleSchemaFormatter functionality
tests/agentune/core/formatter/init.py Adds module docstring for formatter test package

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 81 to 82
# Convert Dtype to simple string representation
dtype_str = repr(field.dtype.polars_type)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using repr() on polars types may produce verbose output like <class 'polars.datatypes.Int32'> instead of cleaner names like Int32. Consider using str(field.dtype.polars_type) or extracting just the type name for more readable LLM prompts.

Suggested change
# Convert Dtype to simple string representation
dtype_str = repr(field.dtype.polars_type)
# Convert Dtype to simple, readable string representation
polars_type = field.dtype.polars_type
dtype_str = getattr(polars_type, "__name__", str(polars_type))

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this in practice when looking into Opik?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Polars type? It need the be the SQL type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is always defined, and it works in practice, the types look like:
Int32
Float64
Date
Boolean
Enum(categories=['category A', 'category B'])
String

Comment on lines 81 to 82
# Convert Dtype to simple string representation
dtype_str = repr(field.dtype.polars_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this in practice when looking into Opik?

"""
sections = []

# Format primary table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting would be clearer with a helper function I think please consolidate these.

formatter = SimpleSchemaFormatter(num_samples=3)
result = formatter.format_all_tables(primary_dataset, tables_with_strategies, conn)

# Print the actual output for inspection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we would want to have prints inside tests. This is fine when you debug but I'm not sure we would want to merge it.

Copy link
Collaborator

@leonidb leonidb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatter assumes we always use markdown for prompts. I think we can make such assumption, by I wouldn't assume a certain level of heading, I think it should accept the level under which the schema is nested

num_samples: int = 5
sampler: DataSampler = HeadSampler()

def _serialize_schema_and_samples(self, schema: Schema, sample_data: Dataset) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_serialize_schema_and_samples - format, not serialize

It could be a TableFormatter class actually, injected into the Schema Formatter constructor and used for every table sample.

@attrs.frozen
class TableFormatter(ABC, UseTypeTag):
"""Formats a single table/dataset to string."""

@abstractmethod
def format_table(self, dataset: Dataset) -> str:
    """Format a dataset to string representation."""
    ...

It can also deal with table schema and samples formatting.

Table format needs to deal with corner cases. There are a lot of them and we can define them separately later, but I think we should at least truncate strings to avoid breaking everything if there is one column with long text. I don't think csv formatter does that by default. Can you check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed all of it to TablesFormatter, it makes more sense anyway, and I made the format_table part of the abs api.

Comment on lines 81 to 82
# Convert Dtype to simple string representation
dtype_str = repr(field.dtype.polars_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Polars type? It need the be the SQL type

@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 7f172ba to cd0cdae Compare January 12, 2026 12:44
@yotam319-sparkbeyond yotam319-sparkbeyond changed the base branch from main to feat/435-create-duckdb-samplers January 12, 2026 12:46
@leonidb
Copy link
Collaborator

leonidb commented Jan 12, 2026

I think we should truncate string values in samples, to lets say a 100 chars (configurable in table formatter)

if field.dtype.polars_type in (pl.String, pl.Utf8):
# Truncate long strings
select_exprs.append(
pl.when(pl.col(col_name).str.len_bytes() > self.max_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use str.len_chars() instead of str.len_bytes(), comparison can be incorrect, depending on the encoding of the strings.

for field in dataset.schema.cols:
col_name = field.name
# Check if column is a string type
if field.dtype.polars_type in (pl.String, pl.Utf8):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We only use a specified set of types, and pl.Utf8 is not one of them, so it's enough to check pl.String

@danarmak
probably even better to rely on the field.dtype directly right? With something like field.dtype==types.string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, field.dtype can be tested directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing field.dtype in (types.string, types.json_dtype) to avoid large strings and large jsons (we may want a different truncation for json in the future)

@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from f125c36 to be4f8cb Compare January 13, 2026 08:53
@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from afefa2e to 0cdc2ab Compare January 13, 2026 08:59
@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from be4f8cb to abd4a31 Compare January 13, 2026 15:59
@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 5e9e904 to 8a6ca93 Compare January 14, 2026 09:48
@yotam319-sparkbeyond yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from 0e825c5 to a1a9ec2 Compare January 14, 2026 11:18
Base automatically changed from feat/435-create-duckdb-samplers to main January 14, 2026 12:57
@leonidb leonidb force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 8a6ca93 to 2fb61a2 Compare January 14, 2026 13:01
@leonidb leonidb merged commit b3efcf3 into main Jan 14, 2026
1 check passed
@leonidb leonidb deleted the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch January 14, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants