Skip to content

Evaluate feasibility of automated file content pattern creation #162

@nightlark

Description

@nightlark

The gist of this idea is that with some package repositories we have links to the source code used to build the binary packages. And we can leverage our work with source code parsing to automatically extract features that should appear in the compiled binary, and possibly use AI to automatically create regular expressions for recognizing version strings.

Potential issues:

  • Identifying good strings (or other data in binaries) -- what is unique, and ideally has version identifiers that will work cross-platform
  • Symbol names could be interesting, but need good demangler that can handle multiple decompilers (this might be more useful for a check once we think we know a particular library is present to confirm that there is actual code in a file)
  • Having a way to efficiently check thousands of patterns (10k+ minimum, potentially 100k+)
    • A fast pattern matching implementation was tested in Surfactant -- it is able to recognize 15k+ patterns in around 5.9 sec, and 100k+ in about 35 seconds. Most of the increase in time is building the Aho-Corasick automaton, which could be optimized/cached for subsequent runs. For comparison checking 15k+ regexes one at a time took 5min 18sec (100k+ was not tested because it would be over 30min).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions