build/toolset.jam regexp review #493
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Of the 13 (they say bring bad luck) regexps in
build/toolset.jamI only save 3. In reality there are only 6 different regexps but only 2 are free of defects.And now I'll show you how fragile a regexp can be. Especially when you don't look at it critically, or worse yet, don't test it enough.
Furthermore, the regex used by JAM is certainly not without flaws (no implementation is), and you need to be aware of them to avoid designing a regular expression that doesn't work as you want.
In
build/toolset.jamthese regexps all involve manipulating flags or module/rule identifiers.".*([.]).*"We find it on lines 96, 162. You don't need to search the git log to understand that they are used to find if there is the "." character in a value.
Since the pattern is unanchored (doesn't contain ^ or $) and only catches the "." it makes no difference what comes before or after, so the right way to write this regexp is simply
"([.])"or(\\.)"([^.]*).*"On lines 108, 174, 211. This time a look at the code helps to understand that it is used on values that are already known to contain the "." and we want to extract the field that precedes it.
Here too, as in the previous case, there is no need for the .* at the end of the pattern. So the correct expression is
"([^.]*)""[^.]*\.(.*)"On lines 621, 649. This is used to extract the field after the first "." of the value.
Now since the first part of the pattern hooks onto the longest possible part that does not contain the "." after this there can only be a ".". The backslash is therefore not only badly written (it should have been
\\) but also useless and the correct version is simply the one without the \ i.e."[^.]*.(.*)""^(.+)\\.([^\\.])*"Lines 379, 423. Here the commentary comes to our aid.
# Strip away last dot separated part and recurse.but also the command
git log -L 379,380:src/build/toolset.jamwhich shows us that the previously used version of the pattern was^(.+)\\..*then replaced in commit 58f0dbb
This helps us understand what happened to the poor pattern. Unlike the previous cases, here the pattern explicitly requires that there be at least one character before the "." But the problem is what it wants to find after the ".".
In the original version,
.*was fine, but then everything was converted to([^\\.])*which doesn't do the same thing at all.When placed in a class, special characters lose their status and therefore there is no need to escape the "."; in fact, this causes the unwanted capture of the
\. But more insidious is the fact that the * is outside the captured group rather than inside it, which in my opinion represents the biggest flaw in this case.In conclusion, the correct version of this last expression is
"^(.+)\\.([^.]*)"There are at least 350 more to check, what do you say we try with artificial intelligence?