feat: Implement SchemaFormatter #148

yotam319-sparkbeyond · 2026-01-08T15:52:09Z

Implement SchemaFormatter, SimpleSchemaFormatter and HeadSampler for schema formatting and sampling

Changes

Implement SchemaFormatter abstract class.
Implement SimpleSchemaFormatter that Formats all available tables with their schemas and sample data to a single string
added HeadSampler as a basic DataSampler (used for SimpleSchemaFormatter)

Related Issues

closes https://github.com/SparkBeyond/ao-core/issues/91

Copilot

Pull request overview

This PR implements a schema formatting system for LLM prompts, enabling structured representation of database tables with their schemas and sample data.

Key Changes:

Introduces SchemaFormatter abstract base class and SimpleSchemaFormatter implementation for formatting table schemas
Adds HeadSampler for basic data sampling functionality
Includes comprehensive test coverage for the new formatting components

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
agentune/core/formatter/base.py	Adds `SchemaFormatter` abstract class with helper methods for schema serialization and sample data formatting
agentune/core/formatter/schema.py	Implements `SimpleSchemaFormatter` that formats all tables with schemas and sample data in markdown format
agentune/core/sampler/base.py	Adds `HeadSampler` class for selecting the first N rows from a dataset
tests/agentune/core/formatter/test_schema.py	Provides test fixtures and test cases for `SimpleSchemaFormatter` functionality
tests/agentune/core/formatter/init.py	Adds module docstring for formatter test package

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-08T15:52:49Z

agentune/core/formatter/base.py

+            # Convert Dtype to simple string representation
+            dtype_str = repr(field.dtype.polars_type)


Using repr() on polars types may produce verbose output like <class 'polars.datatypes.Int32'> instead of cleaner names like Int32. Consider using str(field.dtype.polars_type) or extracting just the type name for more readable LLM prompts.

Suggested change

# Convert Dtype to simple string representation

dtype_str = repr(field.dtype.polars_type)

# Convert Dtype to simple, readable string representation

polars_type = field.dtype.polars_type

dtype_str = getattr(polars_type, "__name__", str(polars_type))

Did you see this in practice when looking into Opik?

Why Polars type? It need the be the SQL type

because it is always defined, and it works in practice, the types look like:
Int32
Float64
Date
Boolean
Enum(categories=['category A', 'category B'])
String

agentune/core/sampler/base.py

ErikLagerSB · 2026-01-11T06:50:29Z

agentune/core/formatter/base.py

+            # Convert Dtype to simple string representation
+            dtype_str = repr(field.dtype.polars_type)


Did you see this in practice when looking into Opik?

ErikLagerSB · 2026-01-11T06:53:07Z

agentune/core/formatter/schema.py

+        """
+        sections = []
+
+        # Format primary table


Formatting would be clearer with a helper function I think please consolidate these.

ErikLagerSB · 2026-01-11T06:57:57Z

tests/agentune/core/formatter/test_schema.py

+        formatter = SimpleSchemaFormatter(num_samples=3)
+        result = formatter.format_all_tables(primary_dataset, tables_with_strategies, conn)
+
+        # Print the actual output for inspection


Not sure we would want to have prints inside tests. This is fine when you debug but I'm not sure we would want to merge it.

leonidb

Formatter assumes we always use markdown for prompts. I think we can make such assumption, by I wouldn't assume a certain level of heading, I think it should accept the level under which the schema is nested

leonidb · 2026-01-11T09:29:42Z

agentune/core/formatter/schema.py

+    num_samples: int = 5
+    sampler: DataSampler = HeadSampler()
+
+    def _serialize_schema_and_samples(self, schema: Schema, sample_data: Dataset) -> str:


_serialize_schema_and_samples - format, not serialize

It could be a TableFormatter class actually, injected into the Schema Formatter constructor and used for every table sample.

@attrs.frozen
class TableFormatter(ABC, UseTypeTag):
"""Formats a single table/dataset to string."""

@abstractmethod def format_table(self, dataset: Dataset) -> str: """Format a dataset to string representation.""" ...

It can also deal with table schema and samples formatting.

Table format needs to deal with corner cases. There are a lot of them and we can define them separately later, but I think we should at least truncate strings to avoid breaking everything if there is one column with long text. I don't think csv formatter does that by default. Can you check

I renamed all of it to TablesFormatter, it makes more sense anyway, and I made the format_table part of the abs api.

leonidb · 2026-01-11T09:30:28Z

agentune/core/formatter/base.py

+            # Convert Dtype to simple string representation
+            dtype_str = repr(field.dtype.polars_type)


Why Polars type? It need the be the SQL type

agentune/core/formatter/tables.py

leonidb · 2026-01-12T14:26:57Z

I think we should truncate string values in samples, to lets say a 100 chars (configurable in table formatter)

leonidb · 2026-01-12T17:03:33Z

agentune/core/formatter/tables.py

+            if field.dtype.polars_type in (pl.String, pl.Utf8):
+                # Truncate long strings
+                select_exprs.append(
+                    pl.when(pl.col(col_name).str.len_bytes() > self.max_str)


Use str.len_chars() instead of str.len_bytes(), comparison can be incorrect, depending on the encoding of the strings.

leonidb · 2026-01-12T17:11:57Z

agentune/core/formatter/tables.py

+        for field in dataset.schema.cols:
+            col_name = field.name
+            # Check if column is a string type
+            if field.dtype.polars_type in (pl.String, pl.Utf8):


Nit: We only use a specified set of types, and pl.Utf8 is not one of them, so it's enough to check pl.String

@danarmak
probably even better to rely on the field.dtype directly right? With something like field.dtype==types.string?

Yes, field.dtype can be tested directly.

testing field.dtype in (types.string, types.json_dtype) to avoid large strings and large jsons (we may want a different truncation for json in the future)

…r for schema formatting and sampling

…able formatting

…sing Polars

Stale

yotam319-sparkbeyond requested review from ErikLagerSB and Copilot January 8, 2026 15:52

Copilot AI reviewed Jan 8, 2026

View reviewed changes

ErikLagerSB requested changes Jan 11, 2026

View reviewed changes

ErikLagerSB approved these changes Jan 11, 2026

View reviewed changes

leonidb reviewed Jan 11, 2026

View reviewed changes

yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 7f172ba to cd0cdae Compare January 12, 2026 12:44

yotam319-sparkbeyond changed the base branch from main to feat/435-create-duckdb-samplers January 12, 2026 12:46

yotam319-sparkbeyond requested a review from ErikLagerSB January 12, 2026 12:47

ErikLagerSB previously requested changes Jan 12, 2026

View reviewed changes

agentune/core/formatter/tables.py Show resolved Hide resolved

leonidb requested changes Jan 12, 2026

View reviewed changes

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from f125c36 to be4f8cb Compare January 13, 2026 08:53

yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from afefa2e to 0cdc2ab Compare January 13, 2026 08:59

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from be4f8cb to abd4a31 Compare January 13, 2026 15:59

leonidb approved these changes Jan 13, 2026

View reviewed changes

yotam319-sparkbeyond force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 5e9e904 to 8a6ca93 Compare January 14, 2026 09:48

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from 0e825c5 to a1a9ec2 Compare January 14, 2026 11:18

Base automatically changed from feat/435-create-duckdb-samplers to main January 14, 2026 12:57

yotam319-sparkbeyond added 11 commits January 14, 2026 15:00

feat: Implement SchemaFormatter, SimpleSchemaFormatter and HeadSample…

5e3682b

…r for schema formatting and sampling

remove unnecessary tests

8bdaf7a

feat: Refactor schema serialization methods and remove obsolete tests

ff399b4

feat: Rename serialization methods to formatting for clarity

91a6165

rename to TablesFormatter and SimpleTablesFormatter

24aa366

mypy fix

dc0df16

feat: Create TableFormatter and MarkdownTableFormatter for improved t…

898aa5d

…able formatting

add example to docstring

1a32253

feat: Enhance MarkdownTableFormatter to truncate long string values u…

1f1d72f

…sing Polars

ruff

1116742

Correct string length check

0c6add4

yotam319-sparkbeyond added 2 commits January 14, 2026 15:00

update string type check

27ae8ee

update

2fb61a2

leonidb force-pushed the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch from 8a6ca93 to 2fb61a2 Compare January 14, 2026 13:01

ruff fix

021de57

leonidb merged commit b3efcf3 into main Jan 14, 2026
1 check passed

leonidb deleted the feat/431-Create-SchemaFormatter-and-SimpleSchemaFormatter branch January 14, 2026 13:15

		# Convert Dtype to simple string representation
		dtype_str = repr(field.dtype.polars_type)

-            # Convert Dtype to simple string representation
-            dtype_str = repr(field.dtype.polars_type)
+            # Convert Dtype to simple, readable string representation
+            polars_type = field.dtype.polars_type
+            dtype_str = getattr(polars_type, "__name__", str(polars_type))

feat: Implement SchemaFormatter #148

feat: Implement SchemaFormatter #148

Conversation

yotam319-sparkbeyond commented Jan 8, 2026

Changes

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonidb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leonidb commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants