Skip to content

Conversation

@paolopas
Copy link
Contributor

@paolopas paolopas commented Jan 7, 2026

Of the 13 (they say bring bad luck) regexps in build/toolset.jam I only save 3. In reality there are only 6 different regexps but only 2 are free of defects.
And now I'll show you how fragile a regexp can be. Especially when you don't look at it critically, or worse yet, don't test it enough.
Furthermore, the regex used by JAM is certainly not without flaws (no implementation is), and you need to be aware of them to avoid designing a regular expression that doesn't work as you want.

In build/toolset.jam these regexps all involve manipulating flags or module/rule identifiers.

".*([.]).*"

We find it on lines 96, 162. You don't need to search the git log to understand that they are used to find if there is the "." character in a value.

Since the pattern is unanchored (doesn't contain ^ or $) and only catches the "." it makes no difference what comes before or after, so the right way to write this regexp is simply

"([.])" or (\\.)

"([^.]*).*"

On lines 108, 174, 211. This time a look at the code helps to understand that it is used on values ​​that are already known to contain the "." and we want to extract the field that precedes it.

Here too, as in the previous case, there is no need for the .* at the end of the pattern. So the correct expression is

"([^.]*)"

"[^.]*\.(.*)"

On lines 621, 649. This is used to extract the field after the first "." of the value.

Now since the first part of the pattern hooks onto the longest possible part that does not contain the "." after this there can only be a ".". The backslash is therefore not only badly written (it should have been \\) but also useless and the correct version is simply the one without the \ i.e.

"[^.]*.(.*)"

"^(.+)\\.([^\\.])*"

Lines 379, 423. Here the commentary comes to our aid.

# Strip away last dot separated part and recurse.

but also the command git log -L 379,380:src/build/toolset.jam which shows us that the previously used version of the pattern was

^(.+)\\..*

then replaced in commit 58f0dbb

This helps us understand what happened to the poor pattern. Unlike the previous cases, here the pattern explicitly requires that there be at least one character before the "." But the problem is what it wants to find after the ".".
In the original version, .* was fine, but then everything was converted to ([^\\.])* which doesn't do the same thing at all.
When placed in a class, special characters lose their status and therefore there is no need to escape the "."; in fact, this causes the unwanted capture of the \. But more insidious is the fact that the * is outside the captured group rather than inside it, which in my opinion represents the biggest flaw in this case.
In conclusion, the correct version of this last expression is

"^(.+)\\.([^.]*)"

There are at least 350 more to check, what do you say we try with artificial intelligence?

@grafikrobot grafikrobot merged commit 5dcb3c8 into bfgroup:main Jan 9, 2026
109 of 110 checks passed
@paolopas paolopas deleted the irregular-1 branch January 9, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants