Skip to content

Conversation

@azeinela
Copy link
Collaborator

@azeinela azeinela commented Dec 15, 2025

Summary

  • add a security-focused detect-non-ascii-characters hook to prevent prompt injection attacks via malicious Unicode content
  • implement three configurable modes: balanced (default, allows ASCII + Latin-1), visible-plus (allows ASCII + emoji), and ascii-only (strict)
  • add grapheme-aware scanning using grapheme>=0.6.0 to properly handle emoji clusters and ZWJ sequences
  • support flexible filtering via --include-range, --allow-chars, --files-glob, --files-include, --files-exclude, and --check-only options
  • document the new hook with usage examples and register its console script entry

Copy link
Collaborator

@rzuckerm rzuckerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss this when Trevor gets back

@rzuckerm rzuckerm requested a review from trevorbowen January 21, 2026 21:30
@azeinela azeinela changed the base branch from main to adtran January 21, 2026 21:45
@azeinela azeinela changed the base branch from adtran to main January 21, 2026 21:51
@rzuckerm
Copy link
Collaborator

This PR needs to be made against the adtran branch, not main.

@azeinela azeinela changed the base branch from main to adtran January 21, 2026 21:54
Copy link
Collaborator

@trevorbowen trevorbowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, although I'd like to see some comments in the code that detail in plain english what each of the ASCII values are, like:

# ASCII values correspond to:
#   0x09 - horizontal tab, \t
#   0x0A - carriage return, \r
#   0x0D - newline, \n
#   0x20-0x7E - printable characters: letters (upper, lower), digits (0-9), punctuation (!, @, #, $. %, ^, &), and space

You could break it down more, if you like, but my main concern was that I knew some special values are control characters, which should not be allowed, but I could not remember what they were without looking them up.


A demo would be nice too. 😄

- `ascii-only`: allow only tab/lf/cr and `0x20-0x7E`; block everything else.
- Examples: `--mode balanced` (default); `--mode visible-plus` (allow 😀, 🚀, etc. but still block zero-width joiners); `--mode ascii-only` (paranoid mode, blocks all non-ASCII).
- `--include-range RANGE` - override allowed byte ranges (comma-separated, decimal or hex, supports `START-END`). Can be repeated.
- Examples: `--include-range 0x09,0x0A,0x0D,0x20-0x7E` (default printable ASCII); `--include-range 0-255` (allow all bytes); `--include-range 0x20-0x7E,0xA0` (allow NBSP too).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that 0-255 means Unicode 0 to 255. Otherwise, this would basically disable this checker.

print(
f'Fixing {filename}: ' f'disallowed bytes {_format_offenders(offenders)}',
)
retv = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed since this is not the check_only case?

Comment on lines +21 to +24
install_requires =
ruamel.yaml>=0.15
grapheme>=0.6.0
tomli>=1.1.0;python_version<"3.11"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed since that is handled in the requirements-dev.txt.

ch = chr(b)
# use repr to surface escapes for backslash/quote while staying ASCII
ch_repr = repr(ch)[1:-1]
return f"0x{b:02x}('{ch_repr}')@{pos}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing the byte number does not seem like the best way to handle this. It seems like there should be a line number and character number instead.

continue

if args.check_only:
print(f'{filename}: disallowed bytes {_format_offenders(offenders)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be shown regardless of check_only?



def test_visible_plus_blocks_bidi_in_cluster(tmp_path, capsys):
"""Test line 107: bidi override check in _cluster_allowed_visible_plus"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either remove this docstring. It is unnecessary, and the line number will change when the code changes. The same goes for all other tests with docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants