samplers: create HeadTableSampler and RandomStartTableSampler table samplers #151

yotam319-sparkbeyond · 2026-01-12T10:01:37Z

What does this PR do?

implement HeadTableSampler and RandomStartTableSampler for table sampling

Changes

create TableSampler abstract class
implement HeadTableSampler and RandomStartTableSampler
test samplers

Related Issues

Fixes https://github.com/SparkBeyond/ao-core/issues/87

Copilot

Pull request overview

This PR implements two new table sampling strategies: HeadTableSampler for retrieving the first N rows of a table, and RandomStartTableSampler for sampling consecutive rows from a random starting position.

Changes:

Introduced abstract TableSampler base class for table sampling operations
Implemented HeadTableSampler to sample first N rows from a table
Implemented RandomStartTableSampler to sample consecutive rows from a random starting point

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
agentune/core/sampler/base.py	Added abstract `TableSampler` base class defining the interface for table sampling
agentune/core/sampler/table_samples.py	Implemented `HeadTableSampler` and `RandomStartTableSampler` concrete classes
tests/agentune/core/sampler/test_table_samplers.py	Comprehensive test suite for both sampler implementations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T10:05:40Z

agentune/core/sampler/table_samples.py

+            Dataset containing the first sample_size rows
+        """
+        table_name = str(table.table.name)
+        sql_query = f'SELECT * FROM {table_name} LIMIT {sample_size}'


SQL query is constructed using f-string interpolation without sanitization. While table_name comes from a DuckdbName object which should be safe, sample_size is a user-provided integer parameter that should be validated or parameterized to prevent potential SQL injection.

Copilot · 2026-01-12T10:05:40Z

agentune/core/sampler/table_samples.py

+        table_name = str(table.table.name)
+
+        # Get table size
+        count_query = f'SELECT COUNT(*) as count FROM {table_name}'


SQL queries use f-string interpolation for start_rowid and sample_size parameters. These values should be validated or use parameterized queries to ensure they are safe integers and prevent potential SQL injection vectors.

Copilot · 2026-01-12T10:05:41Z

agentune/core/sampler/table_samples.py

+
+        # Select consecutive rows starting from the random rowid
+        # Using DuckDB's built-in rowid pseudocolumn for deterministic and efficient filtering
+        sql_query = f'SELECT * FROM {table_name} WHERE rowid >= {start_rowid} LIMIT {sample_size}'


SQL queries use f-string interpolation for start_rowid and sample_size parameters. These values should be validated or use parameterized queries to ensure they are safe integers and prevent potential SQL injection vectors.

agentune/core/sampler/table_samples.py

danarmak · 2026-01-12T10:24:45Z

agentune/core/sampler/table_samples.py

+
+        # Select consecutive rows starting from the random rowid
+        # Using DuckDB's built-in rowid pseudocolumn for deterministic and efficient filtering
+        sql_query = f'SELECT * FROM {table_name} WHERE rowid >= {start_rowid} LIMIT {sample_size}'


Here too, best to write 'order by rowid' explicitly whenever we rely on it.

Also, I agree with copilot: we should use parameterized queries whenever possible, as a general best practice.

danarmak · 2026-01-13T14:48:50Z

agentune/core/sampler/base.py

+class TableSampler(ABC):
+    """Abstract base class for data sampling from TableWithJoinStrategies."""
+    @abstractmethod
+    def sample(self, table: TableWithJoinStrategies, conn: DuckDBPyConnection, sample_size: int, random_seed: int | None = None) -> Dataset:


The interface should accept just a DuckdbName, it doesn't need a DuckdbTable and certainly not the one that's WithStrategies.

in the future we may want to use information from join strategies to customize the examples from each table, for example for conversations we may want a few complete conversations, or for lookup we may want to use a specific selection according to the lookup field.

after talking to leonid - I did change it to be DuckdbName | str

danarmak · 2026-01-13T14:49:23Z

agentune/core/sampler/base.py

+class TableSampler(ABC):
+    """Abstract base class for data sampling from TableWithJoinStrategies."""
+    @abstractmethod
+    def sample(self, table: TableWithJoinStrategies, conn: DuckDBPyConnection, sample_size: int, random_seed: int | None = None) -> Dataset:


I think the default for random seeds should be a constant (traditionally 42) rather than None. The default behavior should be deterministic. (This also applies to the existing DataSampler. @leonidb WDYT?)

I prefer deterministic defaults too. The user can override it with None, if needed.

changed the default to 42 for the TableSampler, for the existing DataSampler we can do in another PR

danarmak · 2026-01-13T14:50:51Z

agentune/core/sampler/table_samples.py

+    
+    This sampler executes a simple SELECT * query with a LIMIT clause
+    to retrieve the head of the table. The order of rows is determined
+    by the table's natural order (or index if present).


What is this 'or index if present'? I know of no such thing.

I updated the docstring

danarmak · 2026-01-13T14:59:26Z

tests/agentune/core/sampler/test_table_samplers.py

+    })
+
+    # Create actual table in DuckDB (not just register)
+    qualified_name = DuckdbName.qualify(table_name, conn)


Tip: you can skip all this code by working with higher-level APIs. Given a dataframe data,

DatasetSink.into_unqualified_duckdb_table(table_name).write(Dataset.from_polars(data).as_source(), conn)

(It's also a one-liner when using the RunContext API)

BTW, the code as you wrote it isn't quite correct because to preserve the schema correctly (for an arbitrary dataframe) you need to call restore_relation_types which, at the API level you were using, means registering the Relation returned by Dataset.to_duckdb and not the dataframe.

great!! I updated the code

danarmak · 2026-01-13T15:01:04Z

tests/agentune/core/sampler/test_table_samplers.py

+        result = sampler.sample(table, conn, sample_size=20)
+
+        # Validate result
+        assert result.data.height == 20


Tip: it's both shorter and more comprehensive (includes dtypes) to compare dataframes: result.data.equals(expected_data.head(20)). However, you'd need the expected_data dataframe.

True, I now read expected = conn.table(str(table.table.name)).pl(), I think this is the correct way to do that for this small scale tests

danarmak · 2026-01-13T15:03:49Z

tests/agentune/core/sampler/test_table_samplers.py

+        # Different seeds should (very likely) produce different starting points
+        ids1 = result1.data['id'].to_list()
+        ids2 = result2.data['id'].to_list()
+        assert ids1[0] != ids2[0] or ids1 != ids2


Either side of the or should be enough, why test both?

True, I changed to only compare the first id values

…e sampling

Reviewed by me, according to comments

yotam319-sparkbeyond requested review from Copilot and leonidb January 12, 2026 10:04

Copilot AI reviewed Jan 12, 2026

View reviewed changes

danarmak requested changes Jan 12, 2026

View reviewed changes

leonidb removed their request for review January 12, 2026 10:30

leonidb assigned danarmak Jan 12, 2026

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from f125c36 to be4f8cb Compare January 13, 2026 08:53

danarmak previously requested changes Jan 13, 2026

View reviewed changes

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from be4f8cb to abd4a31 Compare January 13, 2026 15:59

yotam319-sparkbeyond added 6 commits January 14, 2026 11:45

feat: implement HeadTableSampler and RandomStartTableSampler for tabl…

d553ff3

…e sampling

feat: update sampling queries to use rowid range

f9f968f

feat: optimize table size retrieval in RandomStartTableSampler

a81f555

apply comments

f77cd2e

update TableSampler interface to accept table names

674d3bb

refactor: simplify create_test_table function and update test cases

a1a9ec2

yotam319-sparkbeyond force-pushed the feat/435-create-duckdb-samplers branch from 0e825c5 to a1a9ec2 Compare January 14, 2026 11:18

leonidb approved these changes Jan 14, 2026

View reviewed changes

leonidb requested a review from danarmak January 14, 2026 12:17

leonidb merged commit e808a6d into main Jan 14, 2026
1 check passed

leonidb deleted the feat/435-create-duckdb-samplers branch January 14, 2026 12:57

samplers: create HeadTableSampler and RandomStartTableSampler table samplers #151

samplers: create HeadTableSampler and RandomStartTableSampler table samplers #151

Uh oh!

Conversation

yotam319-sparkbeyond commented Jan 12, 2026

What does this PR do?

Changes

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants