-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
The gist of this idea is that with some package repositories we have links to the source code used to build the binary packages. And we can leverage our work with source code parsing to automatically extract features that should appear in the compiled binary, and possibly use AI to automatically create regular expressions for recognizing version strings.
Potential issues:
- Identifying good strings (or other data in binaries) -- what is unique, and ideally has version identifiers that will work cross-platform
- Symbol names could be interesting, but need good demangler that can handle multiple decompilers (this might be more useful for a check once we think we know a particular library is present to confirm that there is actual code in a file)
- Having a way to efficiently check thousands of patterns (10k+ minimum, potentially 100k+)
- A fast pattern matching implementation was tested in Surfactant -- it is able to recognize 15k+ patterns in around 5.9 sec, and 100k+ in about 35 seconds. Most of the increase in time is building the Aho-Corasick automaton, which could be optimized/cached for subsequent runs. For comparison checking 15k+ regexes one at a time took 5min 18sec (100k+ was not tested because it would be over 30min).
Metadata
Metadata
Assignees
Labels
No labels