Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ Types of changes:
- Allow setting `monitor_only` on check bundles and on individual checks
- Map provider/table-not-found errors to a unified `table_exists` metric
- Support identifier-type filters without explicit column/value for naming; add `identifier_format` option (`identifier`, `filter_name`, `column_name`) to control the result identifier column
- Treat identifier filters with missing/null `value` as a configurable placeholder for logging and naming (defaults to `ALL`)
- Add `identifier_placeholder` option to configure the placeholder value used when identifier filters lack a value; defaults to `ALL` and is applied to the result IDENTIFIER column and logging for clearer partition naming.

### Fixed

- Quote table identifiers in bulk SELECTs when loading data into DuckDB memory to avoid BigQuery binder errors for identifiers that look like project IDs (e.g., `EC0601`). Added an integration test covering the quoting behavior.
- Ensure MatchRateCheck only requires the check column from the left table; the right table now only contributes join and filter columns to avoid unnecessary column selection and errors.

## [0.9.0] - 2026-01-16

Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,12 @@ Koality supports an `identifier` filter type which can be used to mark the field

If an identifier-type filter is defined without a concrete `column` or `value` (for example in global `defaults`), it is treated as a naming-only hint and will not be turned into a WHERE clause; this is useful when you only want to control the result identifier column name (e.g., `SHOP_ID`) across checks.

Behavior for missing identifier values

When an identifier-type filter is present but its `value` is missing or explicitly `null`, Koality substitutes a configurable placeholder for logging and naming (`defaults.identifier_placeholder`, default: `ALL`) to avoid `None` appearing in metric messages. You can override the placeholder at bundle or check level by setting `identifier_placeholder` in the corresponding defaults.

Additional docs: see docs/identifier_placeholder.md for usage examples and configuration details.

### Filter Properties

| Property | Description |
Expand Down
33 changes: 33 additions & 0 deletions docs/checks/matchrate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# MatchRateCheck — check_column placement

Guidance:

- The `check_column` for `MatchRateCheck` must be a column present in the left-hand table only.
- The right-hand table is only expected to provide join columns and filter columns; it must not be required to contain `check_column`.

Configuration example:

```yaml
- defaults:
check_type: MatchRateCheck
check_column: product_number # must exist on the left table
join_columns_left:
- BQ_PARTITIONTIME
- shopId
- product_number
join_columns_right:
- BQ_PARTITIONTIME
- value.shopId
- product_number
checks:
- left_table: project.dataset.left_table
right_table: project.dataset.right_table
filters:
shop_id:
value: SHOP01
```

Notes:

- The executor only requests the `check_column` from the left table during bulk loading; the right table will only be queried for its join and filter columns.
- This avoids errors when the right table does not contain the `check_column` or when its identifier resembles a BigQuery project ID.
2 changes: 2 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,8 @@ filters:

The identifier value appears in check results and failure messages. How it's formatted depends on the `identifier_format` global setting.

For more details about naming-only identifier filters and the `identifier_placeholder` option, see the guide: [Identifier filters and naming](../identifier_filters.md).

### Date Filters

When `type: date` is set, the value is automatically parsed as a date. Supported formats:
Expand Down
66 changes: 66 additions & 0 deletions docs/identifier_placeholder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
Identifier Placeholder

When an identifier-type filter (type: identifier) is defined but the filter's value is missing or explicitly `null`, Koality uses a configurable placeholder string to fill the result IDENTIFIER column and logging messages. This avoids `None` or empty identifiers in persisted results and monitoring UIs.

Configuration

You can set the placeholder in the following locations (more specific levels override less specific ones):

- `defaults.identifier_placeholder` (global default)
- `check_bundles.<bundle>.defaults.identifier_placeholder` (bundle-level)
- `check_bundles.<bundle>.checks.<i>.identifier_placeholder` (check-level)

If not set, the placeholder defaults to `ALL`.

Examples

1) Global default placeholder:

```yaml
defaults:
identifier_placeholder: UNKNOWN
filters:
shop_id:
column: shop_code
type: identifier
value: null
```

Result: IDENTIFIER uses `UNKNOWN` for checks that rely on the `shop_id` identifier when no concrete value is provided.

2) Bundle-level override:

```yaml
check_bundles:
- name: my_bundle
defaults:
identifier_placeholder: ALL_SHOPS
filters:
shop_id:
column: shop_code
type: identifier
value: null
```

3) Check-level override:

```yaml
check_bundles:
- name: my_bundle
defaults:
filters:
shop_id:
column: shop_code
type: identifier
checks:
- check_type: CountCheck
identifier_placeholder: SHOP_UNKNOWN
filters:
shop_id:
value: null
```

Notes

- The placeholder is only applied for naming/logging and does not produce a WHERE clause when the identifier filter lacks a `value` and no column is provided.
- Use a descriptive placeholder (e.g., `ALL`, `UNKNOWN`, `ALL_SHOPS`) to make results easier to interpret in dashboards and logs.
26 changes: 25 additions & 1 deletion src/koality/checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -62,6 +63,7 @@ def __init__(

# Identifier format configuration
self.identifier_format = identifier_format
self.identifier_placeholder = identifier_placeholder

# for where filter handling
self.filters = self.get_filters(filters or {})
Expand All @@ -70,7 +72,11 @@ def __init__(
identifier_filter_result = self.get_identifier_filter(self.filters)
if identifier_filter_result:
filter_name, filter_config = identifier_filter_result
value = filter_config.get("value", "ALL")
# If value key is missing or explicitly None, treat as identifier_placeholder (meaning "no specific value")
if "value" in filter_config and filter_config["value"] is not None:
value = filter_config["value"]
else:
value = self.identifier_placeholder
column = filter_config.get("column", "")

if self.identifier_format == "identifier":
Expand Down Expand Up @@ -465,6 +471,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -481,6 +488,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -563,6 +571,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -578,6 +587,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -628,6 +638,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -645,6 +656,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -699,6 +711,7 @@ def __init__(
extra_info: str | None = None,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
) -> None:
"""Initialize the values in set check."""
Expand All @@ -718,6 +731,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -855,6 +869,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -870,6 +885,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -921,6 +937,7 @@ def __init__(
distinct: bool = False,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand All @@ -942,6 +959,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -1235,6 +1253,7 @@ def __init__(
filters_left: dict[str, Any] | None = None,
filters_right: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
) -> None:
"""Initialize the match rate check."""
Expand Down Expand Up @@ -1270,6 +1289,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -1405,6 +1425,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand Down Expand Up @@ -1434,6 +1455,7 @@ def __init__(
upper_threshold=upper_threshold,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down Expand Up @@ -1569,6 +1591,7 @@ def __init__(
*,
filters: dict[str, Any] | None = None,
identifier_format: str = "identifier",
identifier_placeholder: str = "ALL",
date_info: str | None = None,
extra_info: str | None = None,
monitor_only: bool = False,
Expand Down Expand Up @@ -1611,6 +1634,7 @@ def __init__(
upper_threshold=math.inf,
filters=filters,
identifier_format=identifier_format,
identifier_placeholder=identifier_placeholder,
date_info=date_info,
extra_info=extra_info,
monitor_only=monitor_only,
Expand Down
52 changes: 41 additions & 11 deletions src/koality/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,33 @@ def get_data_requirements(self) -> defaultdict[str, defaultdict[str, set]]: # n
"""
data_requirements = defaultdict(lambda: defaultdict(set))
for check in self.checks:
# Skip synthetic JOIN table entries created for MatchRateCheck; handle left/right tables explicitly
if isinstance(check, MatchRateCheck):
# Add columns and filter columns for left table
if check.check_column and check.check_column != "*":
data_requirements[check.left_table]["columns"].add(check.check_column)
for _filter in check.filters_left.values():
if "column" in _filter:
data_requirements[check.left_table]["columns"].add(_filter["column"])
data_requirements[check.left_table]["columns"].update(check.join_columns_left)

# Add only filter and join columns for right table (check_column is only from left table)
for _filter in check.filters_right.values():
if "column" in _filter:
data_requirements[check.right_table]["columns"].add(_filter["column"])
data_requirements[check.right_table]["columns"].update(check.join_columns_right)

# Store unique filter configurations for both tables
filter_key_left = frozenset(
(name, frozenset(config.items())) for name, config in check.filters_left.items()
)
filter_key_right = frozenset(
(name, frozenset(config.items())) for name, config in check.filters_right.items()
)
data_requirements[check.left_table]["filters"].add(filter_key_left)
data_requirements[check.right_table]["filters"].add(filter_key_right)
continue

table_name = check.table
check_filters = check.filters
# Add check-specific columns and filter columns to the requirements
Expand All @@ -224,16 +251,12 @@ def get_data_requirements(self) -> defaultdict[str, defaultdict[str, set]]: # n
if "column" in _filter:
data_requirements[table_name]["columns"].add(_filter["column"])

# For MatchRateCheck, add columns from both left and right tables
if isinstance(check, MatchRateCheck):
data_requirements[check.left_table]["columns"].update(check.join_columns_left)
data_requirements[check.right_table]["columns"].update(check.join_columns_right)
for _filter in check.filters_left.values():
if "column" in _filter:
data_requirements[check.left_table]["columns"].add(_filter["column"])
for _filter in check.filters_right.values():
if "column" in _filter:
data_requirements[check.right_table]["columns"].add(_filter["column"])
if isinstance(check, IqrOutlierCheck):
check_filters = {k: v for k, v in check.filters.items() if v.get("type") != "date"}

# Store unique filter configurations for each table
filter_key = frozenset((name, frozenset(config.items())) for name, config in check_filters.items())
data_requirements[table_name]["filters"].add(filter_key)

if isinstance(check, IqrOutlierCheck):
check_filters = {k: v for k, v in check.filters.items() if v.get("type") != "date"}
Expand Down Expand Up @@ -271,10 +294,16 @@ def fetch_data_into_memory(self, data_requirements: defaultdict[str, defaultdict
if all_filters_sql:
final_where_clause = "WHERE " + " OR ".join(all_filters_sql)

# Determine appropriate table quoting depending on database provider
if self.database_provider and getattr(self.database_provider, "type", "").lower() == "bigquery":
table_ref = f"`{table}`"
else:
table_ref = f'"{table}"'

# Construct the bulk SELECT query
select_query = f"""
SELECT {columns}
FROM {table}
FROM {table_ref}
{final_where_clause}
""" # noqa: S608

Expand Down Expand Up @@ -318,6 +347,7 @@ def execute_checks(self) -> None:
check_kwargs["database_accessor"] = self.config.database_accessor
check_kwargs["database_provider"] = self.database_provider
check_kwargs["identifier_format"] = self.config.defaults.identifier_format
check_kwargs["identifier_placeholder"] = self.config.defaults.identifier_placeholder
check_instance = check_factory(**check_kwargs)
self.checks.append(check_instance)

Expand Down
Loading
Loading