-
Notifications
You must be signed in to change notification settings - Fork 212
Run yara-x rules against small blocks of sbufs #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…it if enabled and present
…ed rule that is unlikely to generate matches and it also does not use feature-recorders to record any matches. But it does get called and pass tests with the address_sanitizer. I will need to think of some ways to unit test this.
…s with the block/forensic path of the data block that triggered the rule.
|
Also, the |
| else | ||
| export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to consolidate this pkg-config block with the one above it for lightgrep to avoid duplication.
|
Why 4kb chunks and not 16kb or even 64kb?
…On Sat, Mar 22, 2025 at 5:20 PM Jon Stewart ***@***.***> wrote:
This is currently draft. The idea is to link against the new yara-x
library with its C API (https://virustotal.github.io/yara-x/docs/api/c/c-/)
and then run rules against 4KB-sized chunks contained in each sbuf. Like
lightgrep, yara invokes a callback when there's a match on a file, and the
name of the matching rule is written out to the feature recorder along with
the pos_t of the 4KB block (and, oops, just realized I have a bug here, oh
well, it's draft).
What's left to do:
- take yara rules location from a CLI arg
- load yara rules from the CLI arg (finding all .yar files recursively
if the path is a directory)
- fix pos_t bug above
- some unit testing
- more error-handling
- outputting the "namespace" of a matching rule along with its
identifier
- outputting some more context??
- adding an option to change the block size from 4KB?? (an old Python
script called page_brute used 4KB)
- documentation??????
------------------------------
You can view, comment on, or merge this pull request online at:
#487
Commit Summary
- a8e6941
<a8e6941>
configure should check for existence of yara_x_capi and link against it if
enabled and present
- f079e26
<f079e26>
create a stub scanner for scan_yarax, with the config symbol being
HAVE_YARAX
- af88d0e
<af88d0e>
basic scanning of sbufs with yara_x implemented. This uses a hard-coded
rule that is unlikely to generate matches and it also does not use
feature-recorders to record any matches. But it does get called and pass
tests with the address_sanitizer. I will need to think of some ways to unit
test this.
- bbd5cdf
<bbd5cdf>
Create a yara-x feature recorder and write the names of matching rules with
the block/forensic path of the data block that triggered the rule.
File Changes
(4 files <https://github.com/simsong/bulk_extractor/pull/487/files>)
- *M* configure.ac
<https://github.com/simsong/bulk_extractor/pull/487/files#diff-49473dca262eeab3b4a43002adb08b4db31020d190caaad1594b47f1d5daa810>
(33)
- *M* src/Makefile.am
<https://github.com/simsong/bulk_extractor/pull/487/files#diff-4cb884d03ebb901069e4ee5de5d02538c40dd9b39919c615d8eaa9d364bbbd77>
(1)
- *M* src/bulk_extractor_scanners.h
<https://github.com/simsong/bulk_extractor/pull/487/files#diff-3d434f929500bdb0227bcc46b6f64c504c1105a61cef4501390917b53a9f7661>
(3)
- *A* src/scan_yarax.cpp
<https://github.com/simsong/bulk_extractor/pull/487/files#diff-e049b5b12f54f3ad765df054afc201c5c27527a1ccb2563f49da35f3a58108ab>
(111)
Patch Links:
- https://github.com/simsong/bulk_extractor/pull/487.patch
- https://github.com/simsong/bulk_extractor/pull/487.diff
—
Reply to this email directly, view it on GitHub
<#487>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMFHLBI6B5L5M3WZ2RFV3T2VXH2PAVCNFSM6AAAAABZSIVT2GVHI2DSMVQWIX3LMV43ASLTON2WKOZSHE2DANZRGU3DAMY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes. It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches. I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases. I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware. |
…st the much larger bulk_extractor page offset (typically 16MB).
I think that you should make the page size a parameter so it can be tuned at runtime. |
|
I don’t disagree about the need to specify a different block size at runtime.One question: is there a way with a feature recorder to provide a string (for example, the associated rule name with a pattern match) but have b_e carve surrounding context automatically?On Mar 23, 2025, at 10:43 AM, Simson L. Garfinkel ***@***.***> wrote:
Why 4kb chunks and not 16kb or even 64kb?
Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.
It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).
So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.
I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.
I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.
I think that you should make the page size a parameter so it can be tuned at runtime.
Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about.
Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
simsong left a comment (simsong/bulk_extractor#487)
Why 4kb chunks and not 16kb or even 64kb?
Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.
It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).
So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.
I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.
I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.
I think that you should make the page size a parameter so it can be tuned at runtime.
Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about.
Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Yes. Feature recorders can be set up to automatically carve when features are detected. However you may need more flexibility then these three options. |
|
Sorry, "carve" is an overloaded word. I simply mean the context extraction in the third column. I'm not sure that would be useful, though. |
It's the file format. |
…yar files containing rules to be loaded. Uses these for scanning 4KB chunks at a time. Added a semi-duplicative second scanning pass over the sbuf in one go. Will need to check out various rule bases and see what makes the most sense (and an option for control). But I was able to get a number of hits on known malware
|
This will need better error handling and reporting. Probably the most likely use case is for users to point it at a large repository of .yar files, each containing one–many yara rules, and some of those will have syntax errors and not get included. |
|
I just wanted to add some context here which might be useful. In velociraptor we have a similar problem in that file access is abstracted via an accessor (eg we read the file from the NTFS parser or memory etc) but a lot of Yara rules rely on matches very far apart. There does not seem to be a way to tell Yara this is a reader object, please use that to read the file. The library either operates on a buffer or a file. In the case of a file the library mmaps the entire file and that allows it to efficiently match strings very far apart. We have resorted to a short cut when we actually scan files that are real we do delegate them to the library for mmap otherwise many rules that are expected to match don't. I agree with the fact that the rule has to be written in the context of the scan in mind. People try to use rules intended for files on memory and that fails in exactly the same way as described above. But I think people don't really want to think about it, they just want to throw random rules at the sample and hope it works. We have implemented buffer size choice as well for cases when we need to parse in buffers but it's actually not that useful because people don't know in advance their rules and don't really understand what that choice means |
|
Thank you, scudette. I agree that most users will want to point this at a directory of rules they've gotten from other sources and not explore buffer sizes, etc. The present implementation first scans in separate 4KB chunks and then does a second scan over the entire 16MB chunk (with additional 1MB of shingle). I am worried about It may also be possible to record the raw pattern matches, even if rules don't match. This may be noisier (depending on the rules) but also could turn up something where a rule wouldn't have matched. |
|
This is exactly the problem we encountered, some rules use the offset as a header check and then look for patterns deep in the file. This works when the file is mmaped but only works on the first buffer if read in buffered mode. This is the behaviour that was confusing users the most |
|
Recording the individual hits is easy by replacing the condition clause with "any of them". We are starting to apply linting to Yara rules in order to remove low quality rules from large rule sets. The problem is that many rules are broken and if a user gives a file with thousands of rules we can pass the entire thing to the library to compile and it basically returns an error. We have no idea which rule is broken So now we actually lint individual rules and do some sanity checking as well https://docs.velociraptor.app/vql_reference/parsers/yara_lint/ |
|
@jonstewart & @scudette — it seems like there are a lot of bad rules out there. Perhaps we should use AI to figure out the rule and then rewrite it? |
|
This issue can shed more light on the present discussion VirusTotal/yara-x#470 The interesting thing here is that a rule may be broken depending on the mode in which the scanning occurs. If a rule refers to certain functions (like uintXX in the above discussion) it will be broken when in buffer mode but work in file mode. The above issue introduced a FileScan mode for yarax Golang implementation to make such rules work. But this should also apply to the C++ binding depending on the scanning mode. |
|
Thanks, @scudette, that was an interesting and relevant read. If I can characterize your primary concern, it’s that you need to manage memory usage strictly on the user side—makes sense for an endpoint system! And then the secondary concern is, of course, performance.
The proclivity for authors to cargo-cult rules and for investigators to run large corpora of rules “just in case” makes yara an attractive nuisance. A whole bunch of regexps (fixed strings are also regexps; yes yes, I know Aho-Corasick…) get smashed together into one big ball of dismal logic that leaves a processor stalled on memory access and mispredicted branches. It might be a good idea to limit how many rules a user can run in Velociraptor in one hunt, as a forcing-function. It is kind of user-hostile, but avoids bad outcomes.
Clearly I need to finish my llama tool for forensics scanning along with its yara rule converter… The solution is a more-structured rule DSL and the ability to evaluate atomic features beyond brittle Boolean conditions.
Jon
|
|
The issue here is not specifically about regex but about yara functions which are evaluated on the data and how the data is itself managed. Yara itself has the same issue because the issue is not specifically about Velociraptor or an endpoint scenario but rather how to actually scan a file - ultimately we have to read the file in buffers then apply the rule on each buffer. How large to make the buffer? what happens at the buffer's edge? And finally how to evaluate yara functions which require access to specific offsets in the buffer (like uint16(offset) needs to read a fix offset in the file). Yara X does not have a way to provide arbitrary data to their function evaluator so it is impossible to evaluate uint16() on a buffer (because the buffer may be in the middle of a file). So Yara X just turns off these functions when scanning in buffer mode. As a special case when scanning in "file" mode, YaraX can in fact read arbitrary offsets into the file so then, yes these functions do work. My point above is that to evaluate if a rule is valid of not, requires that we know how it is scanned as well. Scanning by buffers (which I would imagine a carver like BE will have to do) will break pretty much every rule that uses yara functions and modules (like pe, lnk etc which are disabled when scanning in buffers). This may be unexpected to users but it does make sense - for example, how would a carver evaluate a rule like uint16(0) ? there is no file header to speak of? As an extension, for BE, you can throw away any and all rules that use yara functions - this is easy to do now because we have an AST parser. This can reduce the number of rules substantially and actually clarify matters a lot. |
This is currently draft. The idea is to link against the new
yara-xlibrary with its C API (https://virustotal.github.io/yara-x/docs/api/c/c-/) and then run rules against 4KB-sized chunks contained in each sbuf. Like lightgrep, yara invokes a callback when there's a match on a file, and the name of the matching rule is written out to the feature recorder along with the pos_t of the 4KB block (and, oops, just realized I have a bug here, oh well, it's draft).What's left to do:
page_bruteused 4KB)