-
Notifications
You must be signed in to change notification settings - Fork 2
DX-902: Add a non-ASCII guard with documentation and tests #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: adtran
Are you sure you want to change the base?
Conversation
rzuckerm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss this when Trevor gets back
|
This PR needs to be made against the |
…date tests accordingly
1c1fed6 to
490d43f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, although I'd like to see some comments in the code that detail in plain english what each of the ASCII values are, like:
# ASCII values correspond to:
# 0x09 - horizontal tab, \t
# 0x0A - carriage return, \r
# 0x0D - newline, \n
# 0x20-0x7E - printable characters: letters (upper, lower), digits (0-9), punctuation (!, @, #, $. %, ^, &), and space
You could break it down more, if you like, but my main concern was that I knew some special values are control characters, which should not be allowed, but I could not remember what they were without looking them up.
A demo would be nice too. 😄
| - `ascii-only`: allow only tab/lf/cr and `0x20-0x7E`; block everything else. | ||
| - Examples: `--mode balanced` (default); `--mode visible-plus` (allow 😀, 🚀, etc. but still block zero-width joiners); `--mode ascii-only` (paranoid mode, blocks all non-ASCII). | ||
| - `--include-range RANGE` - override allowed byte ranges (comma-separated, decimal or hex, supports `START-END`). Can be repeated. | ||
| - Examples: `--include-range 0x09,0x0A,0x0D,0x20-0x7E` (default printable ASCII); `--include-range 0-255` (allow all bytes); `--include-range 0x20-0x7E,0xA0` (allow NBSP too). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that 0-255 means Unicode 0 to 255. Otherwise, this would basically disable this checker.
| print( | ||
| f'Fixing {filename}: ' f'disallowed bytes {_format_offenders(offenders)}', | ||
| ) | ||
| retv = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed since this is not the check_only case?
| install_requires = | ||
| ruamel.yaml>=0.15 | ||
| grapheme>=0.6.0 | ||
| tomli>=1.1.0;python_version<"3.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be removed since that is handled in the requirements-dev.txt.
| ch = chr(b) | ||
| # use repr to surface escapes for backslash/quote while staying ASCII | ||
| ch_repr = repr(ch)[1:-1] | ||
| return f"0x{b:02x}('{ch_repr}')@{pos}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing the byte number does not seem like the best way to handle this. It seems like there should be a line number and character number instead.
| continue | ||
|
|
||
| if args.check_only: | ||
| print(f'{filename}: disallowed bytes {_format_offenders(offenders)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be shown regardless of check_only?
|
|
||
|
|
||
| def test_visible_plus_blocks_bidi_in_cluster(tmp_path, capsys): | ||
| """Test line 107: bidi override check in _cluster_allowed_visible_plus""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please either remove this docstring. It is unnecessary, and the line number will change when the code changes. The same goes for all other tests with docstrings.
Summary
detect-non-ascii-charactershook to prevent prompt injection attacks via malicious Unicode contentbalanced(default, allows ASCII + Latin-1),visible-plus(allows ASCII + emoji), andascii-only(strict)grapheme>=0.6.0to properly handle emoji clusters and ZWJ sequences--include-range,--allow-chars,--files-glob,--files-include,--files-exclude, and--check-onlyoptions