Skip to content

Conversation

@jonstewart
Copy link
Collaborator

This is currently draft. The idea is to link against the new yara-x library with its C API (https://virustotal.github.io/yara-x/docs/api/c/c-/) and then run rules against 4KB-sized chunks contained in each sbuf. Like lightgrep, yara invokes a callback when there's a match on a file, and the name of the matching rule is written out to the feature recorder along with the pos_t of the 4KB block (and, oops, just realized I have a bug here, oh well, it's draft).

What's left to do:

  • take yara rules location from a CLI arg
  • load yara rules from the CLI arg (finding all .yar files recursively if the path is a directory)
  • fix pos_t bug above
  • some unit testing
  • more error-handling
  • outputting the "namespace" of a matching rule along with its identifier
  • outputting some more context??
  • adding an option to change the block size from 4KB?? (an old Python script called page_brute used 4KB)
  • documentation??????

…ed rule that is unlikely to generate matches and it also does not use feature-recorders to record any matches. But it does get called and pass tests with the address_sanitizer. I will need to think of some ways to unit test this.
…s with the block/forensic path of the data block that triggered the rule.
@jonstewart jonstewart marked this pull request as draft March 22, 2025 21:21
@jonstewart
Copy link
Collaborator Author

Also, the yara_x.h header has a bug. It's valid C but uses a C++ reserved keyword as a variable name in a function argument. It's trivial to fix, but yara-x does not work out of the box at present. I reported the issue to the project.

else
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
fi
fi
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to consolidate this pkg-config block with the one above it for lightgrep to avoid duplication.

@simsong
Copy link
Owner

simsong commented Mar 22, 2025 via email

@jonstewart
Copy link
Collaborator Author

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.

It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).

So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.

I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.

I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

…st the much larger bulk_extractor page offset (typically 16MB).
@simsong
Copy link
Owner

simsong commented Mar 23, 2025

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.

It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).

So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.

I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.

I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

I think that you should make the page size a parameter so it can be tuned at runtime.
Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about.
Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.

@jonstewart
Copy link
Collaborator Author

jonstewart commented Mar 23, 2025 via email

@simsong
Copy link
Owner

simsong commented Mar 23, 2025

I don’t disagree about the need to specify a different block size at runtime.One question: is there a way with a feature recorder to provide a string (for example, the associated rule name with a pattern match) but have b_e carve surrounding context automatically?On Mar 23, 2025, at 10:43 AM, Simson L. Garfinkel @.***> wrote:

Yes. Feature recorders can be set up to automatically carve when features are detected.
See:
https://github.com/simsong/be20_api/blob/872be6233f97db650c85ea13f379e4b9d12bad2b/feature_recorder.h#L70

    /**
     * Carving support.
     *
     * Carving writes the filename to the feature file and the contents to a file in the directory.
     * The second field of the feature file is the file's hash using the provided function.
     * Automatically de-duplicates.
     * Carving is implemented in the abstract class so all feature recorders have access to it.
     * Carve mode parameters were previously set in the INIT phase of each of the scanners. Now it is set for
     * all feature recorders in the scanner_set after the scanners are initialized and the feature recorders are created.
     * See scanner_set::apply_scanner_commands
     */
    enum carve_mode_t {
        CARVE_NONE = 0,    // don't carve at all, even if the carve function is called
        CARVE_ENCODED = 1, // only carve if the data being carved is encoded (e.g. BASE64 or GZIP is in path)
        CARVE_ALL = 2      // carve whenever the carve function is called.
    } default_carve_mode {CARVE_ALL};
    size_t min_carve_size {200};
    size_t max_carve_size {16*1024*1024};

However you may need more flexibility then these three options.

@jonstewart
Copy link
Collaborator Author

Sorry, "carve" is an overloaded word. I simply mean the context extraction in the third column. I'm not sure that would be useful, though.

@simsong
Copy link
Owner

simsong commented Mar 23, 2025

Sorry, "carve" is an overloaded word. I simply mean the context extraction in the third column. I'm not sure that would be useful, though.

It's the file format.

…yar files containing rules to be loaded. Uses these for scanning 4KB chunks at a time. Added a semi-duplicative second scanning pass over the sbuf in one go. Will need to check out various rule bases and see what makes the most sense (and an option for control). But I was able to get a number of hits on known malware
@jonstewart
Copy link
Collaborator Author

This will need better error handling and reporting. Probably the most likely use case is for users to point it at a large repository of .yar files, each containing one–many yara rules, and some of those will have syntax errors and not get included.

@scudette
Copy link

I just wanted to add some context here which might be useful. In velociraptor we have a similar problem in that file access is abstracted via an accessor (eg we read the file from the NTFS parser or memory etc) but a lot of Yara rules rely on matches very far apart.

There does not seem to be a way to tell Yara this is a reader object, please use that to read the file. The library either operates on a buffer or a file. In the case of a file the library mmaps the entire file and that allows it to efficiently match strings very far apart.

We have resorted to a short cut when we actually scan files that are real we do delegate them to the library for mmap otherwise many rules that are expected to match don't.

I agree with the fact that the rule has to be written in the context of the scan in mind. People try to use rules intended for files on memory and that fails in exactly the same way as described above.

But I think people don't really want to think about it, they just want to throw random rules at the sample and hope it works.

We have implemented buffer size choice as well for cases when we need to parse in buffers but it's actually not that useful because people don't know in advance their rules and don't really understand what that choice means

https://docs.velociraptor.app/vql_reference/parsers/yara/

@jonstewart
Copy link
Collaborator Author

Thank you, scudette. I agree that most users will want to point this at a directory of rules they've gotten from other sources and not explore buffer sizes, etc. The present implementation first scans in separate 4KB chunks and then does a second scan over the entire 16MB chunk (with additional 1MB of shingle).

I am worried about offset(0) usage requiring alignment, hence 4KB. But you make a good point about requisite patterns being located relatively far apart. It's probably going to require some empiricism to get to the best approach.

It may also be possible to record the raw pattern matches, even if rules don't match. This may be noisier (depending on the rules) but also could turn up something where a rule wouldn't have matched.

@scudette
Copy link

This is exactly the problem we encountered, some rules use the offset as a header check and then look for patterns deep in the file. This works when the file is mmaped but only works on the first buffer if read in buffered mode.

This is the behaviour that was confusing users the most

@scudette
Copy link

Recording the individual hits is easy by replacing the condition clause with "any of them".

We are starting to apply linting to Yara rules in order to remove low quality rules from large rule sets.

The problem is that many rules are broken and if a user gives a file with thousands of rules we can pass the entire thing to the library to compile and it basically returns an error. We have no idea which rule is broken

So now we actually lint individual rules and do some sanity checking as well

https://docs.velociraptor.app/vql_reference/parsers/yara_lint/

@simsong
Copy link
Owner

simsong commented Nov 7, 2025

@jonstewart & @scudette — it seems like there are a lot of bad rules out there. Perhaps we should use AI to figure out the rule and then rewrite it?

@scudette
Copy link

scudette commented Nov 7, 2025

This issue can shed more light on the present discussion VirusTotal/yara-x#470

The interesting thing here is that a rule may be broken depending on the mode in which the scanning occurs. If a rule refers to certain functions (like uintXX in the above discussion) it will be broken when in buffer mode but work in file mode.

The above issue introduced a FileScan mode for yarax Golang implementation to make such rules work. But this should also apply to the C++ binding depending on the scanning mode.

@jonstewart
Copy link
Collaborator Author

jonstewart commented Nov 7, 2025 via email

@scudette
Copy link

scudette commented Nov 7, 2025

The issue here is not specifically about regex but about yara functions which are evaluated on the data and how the data is itself managed.

Yara itself has the same issue because the issue is not specifically about Velociraptor or an endpoint scenario but rather how to actually scan a file - ultimately we have to read the file in buffers then apply the rule on each buffer. How large to make the buffer? what happens at the buffer's edge? And finally how to evaluate yara functions which require access to specific offsets in the buffer (like uint16(offset) needs to read a fix offset in the file).

Yara X does not have a way to provide arbitrary data to their function evaluator so it is impossible to evaluate uint16() on a buffer (because the buffer may be in the middle of a file). So Yara X just turns off these functions when scanning in buffer mode.

As a special case when scanning in "file" mode, YaraX can in fact read arbitrary offsets into the file so then, yes these functions do work.

My point above is that to evaluate if a rule is valid of not, requires that we know how it is scanned as well. Scanning by buffers (which I would imagine a carver like BE will have to do) will break pretty much every rule that uses yara functions and modules (like pe, lnk etc which are disabled when scanning in buffers).

This may be unexpected to users but it does make sense - for example, how would a carver evaluate a rule like uint16(0) ? there is no file header to speak of?

As an extension, for BE, you can throw away any and all rules that use yara functions - this is easy to do now because we have an AST parser. This can reduce the number of rules substantially and actually clarify matters a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants