-
-
Notifications
You must be signed in to change notification settings - Fork 0
feat(parse): implement string deduplication with metadata preservation #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parse): implement string deduplication with metadata preservation #117
Conversation
- Added a new module for string deduplication that groups duplicate strings by (text, encoding) while preserving all occurrence metadata. - Introduced `CanonicalString` and `StringOccurrence` structures to represent deduplicated strings and their occurrences. - Enhanced the extraction process to include deduplication options in the `ExtractionConfig`, allowing users to enable/disable deduplication and set thresholds for deduplication. - Updated documentation to reflect the new deduplication features and provided examples for usage. - Added integration tests to validate the deduplication functionality and ensure metadata preservation across different scenarios. This enhancement significantly improves the library's ability to manage and analyze extracted strings, facilitating better binary analysis. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bumped `cargo-dist-version` from 0.30.2 to 0.30.3 for the latest features. - Updated GitHub Actions dependencies: - `actions/checkout` from v5 to v6 - `actions/download-artifact` from v6 to v7 - `actions/upload-artifact` from v5 to v6 These updates enhance CI/CD performance and ensure compatibility with the latest features. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…I/CD workflows - Updated `actions/checkout` from v5 to v6 across multiple workflows for enhanced performance and compatibility. - Adjusted `actions/download-artifact` from v6 to v7 and `actions/upload-artifact` from v5 to v6 to leverage new features. - Ensured consistency in the usage of the latest versions across all workflows. These updates enhance the reliability and efficiency of the CI/CD processes. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bumped `clap` to version 4.5.54 for improved functionality. - Updated `goblin` to version 0.10.4 for better compatibility. - Upgraded `serde_json` to version 1.0.148 to incorporate the latest features and fixes. - Updated `insta` to version 1.46.0 and `tempfile` to version 3.24.0 for enhanced testing capabilities. These updates ensure the project utilizes the latest versions of dependencies, improving overall stability and performance. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Caution Review failedFailed to post review comments Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughThis PR adds a new string deduplication system (preserving per-occurrence metadata, tag merging, and combined scoring), integrates it into the extraction API/flow, adds tests and docs, bumps several crate and CI action versions, and updates release/CI workflows and workspace release settings. Changes
Sequence DiagramsequenceDiagram
participant Binary as Binary Input
participant Extractor as BasicExtractor / StringExtractor
participant Dedup as Deduplication Module
participant Scorer as Scoring Engine
participant Output as FoundString / CanonicalString
Binary->>Extractor: extract()
activate Extractor
Extractor->>Extractor: scan sections -> emit FoundString list
alt enable_deduplication == true
Extractor->>Dedup: deduplicate(found_strings, threshold, preserve_all)
activate Dedup
Dedup->>Dedup: group by (text, encoding)
Dedup->>Dedup: merge tags, collect occurrences
Dedup->>Scorer: calculate_combined_score(group)
activate Scorer
Note over Scorer: base + occurrence_bonus\n+ cross_section_bonus\n+ multi_source_bonus\n+ confidence_boost
Scorer-->>Dedup: combined_score
deactivate Scorer
Dedup->>Dedup: sort groups by combined_score DESC
Dedup-->>Extractor: Vec<CanonicalString>
deactivate Dedup
Extractor->>Output: convert CanonicalString -> FoundString (for backward compat)
else
Extractor-->>Output: Vec<FoundString> (raw)
end
deactivate Extractor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
… unused preprocessor alerts - Removed the `multilingual` setting from the book configuration as it is no longer needed. - Deleted the unused `[preprocessor.alerts]` section to clean up the configuration file. These changes streamline the book configuration, improving clarity and maintainability. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
This pull request focuses on updating dependencies and improving the deduplication system for string extraction, along with several workflow and documentation enhancements. The main changes are upgrading GitHub Actions and Rust dependencies, updating the deduplication logic and documentation, and aligning configuration files with these updates.
Dependency and workflow updates:
actions/checkouttov6,actions/upload-artifacttov6, andactions/download-artifacttov7across all GitHub Actions workflow files for improved security and compatibility. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]Cargo.toml(e.g.,clap,goblin,serde_json,insta,tempfile) to their latest patch versions for bug fixes and improved features.dist-workspace.tomlto usecargo-distversion0.30.3and aligned GitHub Action versions in the config. [1] [2]cargo-dist-installer.sh.Deduplication system improvements:
(text, encoding), preserving all occurrence metadata, merging tags, and calculating combined scores with occurrence-based bonuses. [1] [2] [3] [4] [5]docs/src/architecture.mdand module-level docs insrc/extraction/mod.rsto describe the new deduplication strategy and scoring system. [1] [2] [3] [4]Other improvements:
README.mdfor formatting.These changes collectively modernize the project's dependencies, improve CI reliability, and provide a more robust and well-documented string deduplication pipeline.