Evaluate feasibility of automated file content pattern creation

The gist of this idea is that with some package repositories we have links to the source code used to build the binary packages. And we can leverage our work with source code parsing to automatically extract features that should appear in the compiled binary, and possibly use AI to automatically create regular expressions for recognizing version strings.

Potential issues:
* Identifying good strings (or other data in binaries) -- what is unique, and ideally has version identifiers that will work cross-platform
* Symbol names could be interesting, but need good demangler that can handle multiple decompilers (this might be more useful for a check once we think we know a particular library is present to confirm that there is actual code in a file)
* Having a way to efficiently check thousands of patterns (10k+ minimum, potentially 100k+)
  - A fast pattern matching implementation was tested in Surfactant -- it is able to recognize 15k+ patterns in around 5.9 sec, and 100k+ in about 35 seconds. Most of the increase in time is building the Aho-Corasick automaton, which could be optimized/cached for subsequent runs. For comparison checking 15k+ regexes one at a time took 5min 18sec (100k+ was not tested because it would be over 30min).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluate feasibility of automated file content pattern creation #162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate feasibility of automated file content pattern creation #162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions