Skip to content

Conversation

@BBertram-hex
Copy link
Collaborator

@BBertram-hex BBertram-hex commented Jan 28, 2026

Introduce new Keywords in the havocompare config file:

rules:
  CSV:
    Preprocessing:
  • KeepColumnsByName
  • KeepColumnsByNameG
  • DeleteColumnByNameG

DeleteColumnByName: "" (without the 'G') already exists as a preprocessor step. It does an exact match that deletes the first column from both actual and nominal csv files, whos header exactly matches . So later at the comparison, such columns are treated as if they were never in the actual.csv or nominal.csv (or both).

KeepColumnsByName

KeepColumnByName has a list of strings that are exactly compared to the extracted table headers. Only those headers that have at least one exact match from that list, are kept, e.g.

        - KeepColumnsByName:
            - "Center x [mm]"
            - "Center y [mm]"

Applied to a CSV like

Center x [mm] Center y [mm] Center z [mm] any other string Center x [mm] Center x
1 2 3 4 5 6
....

would delete all matches to any of the strings in the config:

Center x [mm] Center y [mm] DELETED DELETED Center x [mm] DELETED
1 2 DELETED DELETED 5 DELETED
....

Globbing variants (New)

use suffix 'G' in the preprocessor step, then '*', '**', '?' can be used in the list of pattern strings as wildcards.
Preprocessor steps, that support globbing:

  • KeepColumnsByNameG
  • DeleteColumnByNameG

How globbing with

  • first an exact match of the header is tried (like the non-globbing variant above)
    • if the strings are exactly equal, then the column is kept.
  • if the string from the config doesn't match the header exactly, then it is interpreted as a glob pattern.
    • if the glob matches, the column is kept.
  • if a column in the CSV matches the config pattern neither exactly nor as a glob, then the column is marked as deleted in nominal and actual file (thus ignored).

Note: Square brackets, like '[um]' also denote a wildcard, that matches 'u' OR 'm' => Thats why an exact match is also tried, so that the glob variant always matches (and therefore compares) what the exact variant would compare and more.

- KeepColumnsByNameG:
            - "Center *"

matches in CSV:

  • "Center x [mm]", "Center y [mm]",
  • but also "Center " ( not a wildcard

KeepColumnsByNameG: ... - "Center ? [mm]" however would not match "Center x [mm]" because it does not match char-by-char and the pattern "Center ? [mm]" matches only "Center <any character> m". The glob "Center [xyz] [[]mm[]]"` would be doing whats intended (exact unit, spaces and x, y, or z at the right place) but maybe too complicated to read.

closes #57

@BBertram-hex BBertram-hex self-assigned this Jan 28, 2026
))
})?;

if let Some(c) = table.columns.iter_mut().find(|col| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we use filter() here? find will only catch 1 result, and we want more?

extending the unit test might also be a good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it for a while and came up with 3 candidate behaviors, to fix the inconsistencies between new filter steps and the "DeleteColumnByName" which already is out there in version 0.8.

Reasoning with pros (+) and with cons (-)

- fix_A: change old behavior of "DeleteColumnByName", so that all matching columns are deleted.
  - (+) only 3 new keywords DeleteColumnByNameG, KeepColumnsByName, KeepColumnsByNameG.
  - (+) the DeleteColumnByNameG consistently deletes all matching columns as well from next release on.
  - (-) in edge cases where CSVs have several identical headers, this will silently make tests less sensitive when upgrading havocompare.
	- (-) singular "Column" somewhat implies that DeleteColumnByName only deletes a single column.
	
- fix_B: 4 new keywords, all using plural "Columns", the new keywords delete/keep all matching columns.
  -	deprecate DeleteColumnByName, do not change its detection behavior, but print a "warning" info when it is used, that it will be deleted in the future.
	- new DeleteColumnsByName: delete all columns matching the string exactly.
	- new DeleteColumnsByNameG: delete all columns matching the string, either exactly or interpreted as a glob pattern.
	- new KeepColumnsByName: keep all columns that match any of the strings in the list exactly
	- new KeepColumnsByName: keep all columns that match any of the strings in the list, either exactly or interpreted as a glob pattern.
	- (+) no compat problem
	- (+) consistent naming (only plural)
	- (-) some dept, removing the DeleteColumnByName at some point (when, how?)
	- (+) simpler behavior because delete and keep work similar, deleting/keeping all columns that match
	- (-) more code 
			
- fix_C: singlular keywords "DeleteColumnByName(G)" but plural keywors for "KeepColumnsByName(G)"
	- DeleteColumnByName works as in 0.8,
	- DeleteColumnByNameG consistently only deletes the first match (i.e. the first column that matches either way, exactly or the glob pattern).
	- several (N) steps "DeleteColumnByName" with the same string should match and delete the first N columns that match the string. N steps "DeleteColumnByNameG" delete the first N columns that match either exactly or the glob pattern.
	- (+) no compat problem
	- (+) consistent naming, using singular vs plural
	- (-) more complex behavior because delete and keep work differently, affecting the first vs all columns
	- (-) scenario where I want to make 2 rules, rule_1 for column "Special" and 1 for all the other columns (default) I would use KeepColumnsByNameG "Special ?" in rule_1 and DeleteColumnByNameG "Special ?" in rule_2. But the 2 rules do not complement in the case, where several "Special " columns can occur.
	If (alternatively) Delete + glob would also work on all matches, then the inconsistency would be to have 
	"DeleteColumnByName" (singular) and "DeleteColumnsByName" (plural) in the new set of commands.

However:
=> because we want to be on the safe side with test tool, do not implement fix_A for now.
=> the current state of the branch is closer to fix_C, so lets complete that variant first.
=> implement fix_B later.

use crate::csv::{Column, Delimiters, Error};
use std::fs::File;

macro_rules! string_vec {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this is overkill :D ... might use vec!["".to_owned()] or change the arguments in keep_columns_matching_any_names to use Vec<&str>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec<&str> as an argument introduces a superfluous class dependency to Vec just because for testing.
-> clippy also complains

Have to understand the vec![...] you proposed

- exact match of the header string
- other columns will be deleted.
- as fall-back, if CSV header is not exactly the same as the string in the rule
- fixed wrong error string
- clippy -> use slice instead of specific Container
- create a Table from CSV string content for testing
- test/doc current behavior of extract_headers()
- KeepColumnsByNameG
- DeleteColumnByNameG
- TODO: fix ambiguity
- two columns matching the glob should be removed
- but the DELETED marker matches the glob pattern.
@BBertram-hex BBertram-hex force-pushed the 57_csv_preprocessor_whitlisting_columns branch from b992ae0 to de22c83 Compare February 11, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support whitelisting header names when preprocessing a CSV

2 participants