-
Notifications
You must be signed in to change notification settings - Fork 2
57 new preprocessor step for whitlisting of CSV columns #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| )) | ||
| })?; | ||
|
|
||
| if let Some(c) = table.columns.iter_mut().find(|col| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we use filter() here? find will only catch 1 result, and we want more?
extending the unit test might also be a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it for a while and came up with 3 candidate behaviors, to fix the inconsistencies between new filter steps and the "DeleteColumnByName" which already is out there in version 0.8.
Reasoning with pros (+) and with cons (-)
- fix_A: change old behavior of "DeleteColumnByName", so that all matching columns are deleted.
- (+) only 3 new keywords DeleteColumnByNameG, KeepColumnsByName, KeepColumnsByNameG.
- (+) the DeleteColumnByNameG consistently deletes all matching columns as well from next release on.
- (-) in edge cases where CSVs have several identical headers, this will silently make tests less sensitive when upgrading havocompare.
- (-) singular "Column" somewhat implies that DeleteColumnByName only deletes a single column.
- fix_B: 4 new keywords, all using plural "Columns", the new keywords delete/keep all matching columns.
- deprecate DeleteColumnByName, do not change its detection behavior, but print a "warning" info when it is used, that it will be deleted in the future.
- new DeleteColumnsByName: delete all columns matching the string exactly.
- new DeleteColumnsByNameG: delete all columns matching the string, either exactly or interpreted as a glob pattern.
- new KeepColumnsByName: keep all columns that match any of the strings in the list exactly
- new KeepColumnsByName: keep all columns that match any of the strings in the list, either exactly or interpreted as a glob pattern.
- (+) no compat problem
- (+) consistent naming (only plural)
- (-) some dept, removing the DeleteColumnByName at some point (when, how?)
- (+) simpler behavior because delete and keep work similar, deleting/keeping all columns that match
- (-) more code
- fix_C: singlular keywords "DeleteColumnByName(G)" but plural keywors for "KeepColumnsByName(G)"
- DeleteColumnByName works as in 0.8,
- DeleteColumnByNameG consistently only deletes the first match (i.e. the first column that matches either way, exactly or the glob pattern).
- several (N) steps "DeleteColumnByName" with the same string should match and delete the first N columns that match the string. N steps "DeleteColumnByNameG" delete the first N columns that match either exactly or the glob pattern.
- (+) no compat problem
- (+) consistent naming, using singular vs plural
- (-) more complex behavior because delete and keep work differently, affecting the first vs all columns
- (-) scenario where I want to make 2 rules, rule_1 for column "Special" and 1 for all the other columns (default) I would use KeepColumnsByNameG "Special ?" in rule_1 and DeleteColumnByNameG "Special ?" in rule_2. But the 2 rules do not complement in the case, where several "Special " columns can occur.
If (alternatively) Delete + glob would also work on all matches, then the inconsistency would be to have
"DeleteColumnByName" (singular) and "DeleteColumnsByName" (plural) in the new set of commands.
However:
=> because we want to be on the safe side with test tool, do not implement fix_A for now.
=> the current state of the branch is closer to fix_C, so lets complete that variant first.
=> implement fix_B later.
| use crate::csv::{Column, Delimiters, Error}; | ||
| use std::fs::File; | ||
|
|
||
| macro_rules! string_vec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like this is overkill :D ... might use vec!["".to_owned()] or change the arguments in keep_columns_matching_any_names to use Vec<&str>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec<&str> as an argument introduces a superfluous class dependency to Vec just because for testing.
-> clippy also complains
Have to understand the vec![...] you proposed
- exact match of the header string - other columns will be deleted.
- as fall-back, if CSV header is not exactly the same as the string in the rule
- fixed wrong error string - clippy -> use slice instead of specific Container
- create a Table from CSV string content for testing - test/doc current behavior of extract_headers()
- KeepColumnsByNameG - DeleteColumnByNameG - TODO: fix ambiguity
- two columns matching the glob should be removed - but the DELETED marker matches the glob pattern.
b992ae0 to
de22c83
Compare
Introduce new Keywords in the havocompare config file:
DeleteColumnByName: "" (without the 'G') already exists as a preprocessor step. It does an exact match that deletes the first column from both actual and nominal csv files, whos header exactly matches . So later at the comparison, such columns are treated as if they were never in the actual.csv or nominal.csv (or both).
KeepColumnsByName
KeepColumnByName has a list of strings that are exactly compared to the extracted table headers. Only those headers that have at least one exact match from that list, are kept, e.g.
Applied to a CSV like
would delete all matches to any of the strings in the config:
Globbing variants (New)
use suffix 'G' in the preprocessor step, then '*', '**', '?' can be used in the list of pattern strings as wildcards.
Preprocessor steps, that support globbing:
How globbing with
Note: Square brackets, like '[um]' also denote a wildcard, that matches 'u' OR 'm' => Thats why an exact match is also tried, so that the glob variant always matches (and therefore compares) what the exact variant would compare and more.
matches in CSV:
KeepColumnsByNameG: ... - "Center ? [mm]"however would not match "Center x [mm]" because it does not match char-by-char and the pattern"Center ? [mm]"matches only"Center <any character> m". The glob"Center [xyz] [[]mm[]]"` would be doing whats intended (exact unit, spaces and x, y, or z at the right place) but maybe too complicated to read.closes #57