57 new preprocessor step for whitlisting of CSV columns #58

BBertram-hex · 2026-01-28T15:16:38Z

Introduce new Keywords in the havocompare config file:

rules:
  CSV:
    Preprocessing:

KeepColumnsByName
KeepColumnsByNameG
DeleteColumnByNameG

DeleteColumnByName: "" (without the 'G') already exists as a preprocessor step. It does an exact match that deletes the first column from both actual and nominal csv files, whos header exactly matches . So later at the comparison, such columns are treated as if they were never in the actual.csv or nominal.csv (or both).

KeepColumnsByName

KeepColumnByName has a list of strings that are exactly compared to the extracted table headers. Only those headers that have at least one exact match from that list, are kept, e.g.

        - KeepColumnsByName:
            - "Center x [mm]"
            - "Center y [mm]"

Applied to a CSV like

Center x [mm]	Center y [mm]	Center z [mm]	any other string	Center x [mm]	Center x
1	2	3	4	5	6
....

would delete all matches to any of the strings in the config:

Center x [mm]	Center y [mm]	DELETED	DELETED	Center x [mm]	DELETED
1	2	DELETED	DELETED	5	DELETED
....

Globbing variants (New)

use suffix 'G' in the preprocessor step, then '*', '**', '?' can be used in the list of pattern strings as wildcards.
Preprocessor steps, that support globbing:

KeepColumnsByNameG
DeleteColumnByNameG

How globbing with

first an exact match of the header is tried (like the non-globbing variant above)
- if the strings are exactly equal, then the column is kept.
if the string from the config doesn't match the header exactly, then it is interpreted as a glob pattern.
- if the glob matches, the column is kept.
if a column in the CSV matches the config pattern neither exactly nor as a glob, then the column is marked as deleted in nominal and actual file (thus ignored).

Note: Square brackets, like '[um]' also denote a wildcard, that matches 'u' OR 'm' => Thats why an exact match is also tried, so that the glob variant always matches (and therefore compares) what the exact variant would compare and more.

- KeepColumnsByNameG:
            - "Center *"

matches in CSV:

"Center x [mm]", "Center y [mm]",
but also "Center " ( not a wildcard

KeepColumnsByNameG: ... - "Center ? [mm]" however would not match "Center x [mm]" because it does not match char-by-char and the pattern "Center ? [mm]" matches only "Center <any character> m". The glob "Center [xyz] [[]mm[]]"` would be doing whats intended (exact unit, spaces and x, y, or z at the right place) but maybe too complicated to read.

closes #57

src/csv/preprocessing.rs

TheAdiWijaya · 2026-01-29T06:56:41Z

src/csv/preprocessing.rs

+        ))
+    })?;
+
+    if let Some(c) = table.columns.iter_mut().find(|col| {


shouldn't we use filter() here? find will only catch 1 result, and we want more?

extending the unit test might also be a good idea

I thought about it for a while and came up with 3 candidate behaviors, to fix the inconsistencies between new filter steps and the "DeleteColumnByName" which already is out there in version 0.8.

Reasoning with pros (+) and with cons (-)

- fix_A: change old behavior of "DeleteColumnByName", so that all matching columns are deleted. - (+) only 3 new keywords DeleteColumnByNameG, KeepColumnsByName, KeepColumnsByNameG. - (+) the DeleteColumnByNameG consistently deletes all matching columns as well from next release on. - (-) in edge cases where CSVs have several identical headers, this will silently make tests less sensitive when upgrading havocompare. - (-) singular "Column" somewhat implies that DeleteColumnByName only deletes a single column. - fix_B: 4 new keywords, all using plural "Columns", the new keywords delete/keep all matching columns. - deprecate DeleteColumnByName, do not change its detection behavior, but print a "warning" info when it is used, that it will be deleted in the future. - new DeleteColumnsByName: delete all columns matching the string exactly. - new DeleteColumnsByNameG: delete all columns matching the string, either exactly or interpreted as a glob pattern. - new KeepColumnsByName: keep all columns that match any of the strings in the list exactly - new KeepColumnsByName: keep all columns that match any of the strings in the list, either exactly or interpreted as a glob pattern. - (+) no compat problem - (+) consistent naming (only plural) - (-) some dept, removing the DeleteColumnByName at some point (when, how?) - (+) simpler behavior because delete and keep work similar, deleting/keeping all columns that match - (-) more code - fix_C: singlular keywords "DeleteColumnByName(G)" but plural keywors for "KeepColumnsByName(G)" - DeleteColumnByName works as in 0.8, - DeleteColumnByNameG consistently only deletes the first match (i.e. the first column that matches either way, exactly or the glob pattern). - several (N) steps "DeleteColumnByName" with the same string should match and delete the first N columns that match the string. N steps "DeleteColumnByNameG" delete the first N columns that match either exactly or the glob pattern. - (+) no compat problem - (+) consistent naming, using singular vs plural - (-) more complex behavior because delete and keep work differently, affecting the first vs all columns - (-) scenario where I want to make 2 rules, rule_1 for column "Special" and 1 for all the other columns (default) I would use KeepColumnsByNameG "Special ?" in rule_1 and DeleteColumnByNameG "Special ?" in rule_2. But the 2 rules do not complement in the case, where several "Special " columns can occur. If (alternatively) Delete + glob would also work on all matches, then the inconsistency would be to have "DeleteColumnByName" (singular) and "DeleteColumnsByName" (plural) in the new set of commands.

However:
=> because we want to be on the safe side with test tool, do not implement fix_A for now.
=> the current state of the branch is closer to fix_C, so lets complete that variant first.
=> implement fix_B later.

TheAdiWijaya · 2026-01-29T07:00:20Z

src/csv/preprocessing.rs

    use crate::csv::{Column, Delimiters, Error};
    use std::fs::File;

+    macro_rules! string_vec {


feels like this is overkill :D ... might use vec!["".to_owned()] or change the arguments in keep_columns_matching_any_names to use Vec<&str>

Vec<&str> as an argument introduces a superfluous class dependency to Vec just because for testing.
-> clippy also complains

Have to understand the vec![...] you proposed

src/csv/preprocessing.rs

- exact match of the header string - other columns will be deleted.

- as fall-back, if CSV header is not exactly the same as the string in the rule

- fixed wrong error string - clippy -> use slice instead of specific Container

- create a Table from CSV string content for testing - test/doc current behavior of extract_headers()

- KeepColumnsByNameG - DeleteColumnByNameG - TODO: fix ambiguity

- two columns matching the glob should be removed - but the DELETED marker matches the glob pattern.

BBertram-hex requested a review from TheAdiWijaya January 28, 2026 15:16

BBertram-hex self-assigned this Jan 28, 2026

TheAdiWijaya reviewed Jan 29, 2026

View reviewed changes

BBertram-hex added 7 commits February 11, 2026 16:36

57: Whitelisting with KeepColumnByName

27784a6

- exact match of the header string - other columns will be deleted.

57: Whitelisting with KeepColumnByName accepts glob patterns

25613d5

57: DeleteColumnByName accepts glob pattern

a64fa17

- as fall-back, if CSV header is not exactly the same as the string in the rule

57: cleanup

28ae5c8

- fixed wrong error string - clippy -> use slice instead of specific Container

57: unit tests:

70bc3bd

- create a Table from CSV string content for testing - test/doc current behavior of extract_headers()

57: implement explicit 'G' variants that use globbing as fall-back

3ad619b

- KeepColumnsByNameG - DeleteColumnByNameG - TODO: fix ambiguity

57: fix DeleteColumnByNameG case

de22c83

- two columns matching the glob should be removed - but the DELETED marker matches the glob pattern.

BBertram-hex force-pushed the 57_csv_preprocessor_whitlisting_columns branch from b992ae0 to de22c83 Compare February 11, 2026 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

57 new preprocessor step for whitlisting of CSV columns #58

57 new preprocessor step for whitlisting of CSV columns #58

Uh oh!

BBertram-hex commented Jan 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

TheAdiWijaya Jan 29, 2026

Uh oh!

BBertram-hex Feb 11, 2026

Uh oh!

TheAdiWijaya Jan 29, 2026

Uh oh!

BBertram-hex Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

57 new preprocessor step for whitlisting of CSV columns #58

Are you sure you want to change the base?

57 new preprocessor step for whitlisting of CSV columns #58

Uh oh!

Conversation

BBertram-hex commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KeepColumnsByName

Globbing variants (New)

Uh oh!

Uh oh!

TheAdiWijaya Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

BBertram-hex Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

TheAdiWijaya Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

BBertram-hex Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BBertram-hex commented Jan 28, 2026 •

edited

Loading