Skip to content

Conversation

@unclesp1d3r
Copy link
Member

@unclesp1d3r unclesp1d3r commented Jan 4, 2026

This pull request focuses on updating dependencies and improving the deduplication system for string extraction, along with several workflow and documentation enhancements. The main changes are upgrading GitHub Actions and Rust dependencies, updating the deduplication logic and documentation, and aligning configuration files with these updates.

Dependency and workflow updates:

  • Upgraded all usages of actions/checkout to v6, actions/upload-artifact to v6, and actions/download-artifact to v7 across all GitHub Actions workflow files for improved security and compatibility. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]
  • Updated Rust and related dependencies in Cargo.toml (e.g., clap, goblin, serde_json, insta, tempfile) to their latest patch versions for bug fixes and improved features.
  • Updated dist-workspace.toml to use cargo-dist version 0.30.3 and aligned GitHub Action versions in the config. [1] [2]
  • Updated the installer script in the release workflow to use the latest cargo-dist-installer.sh.

Deduplication system improvements:

  • Enhanced the deduplication system in the string extraction module, now grouping strings by (text, encoding), preserving all occurrence metadata, merging tags, and calculating combined scores with occurrence-based bonuses. [1] [2] [3] [4] [5]
  • Updated documentation in docs/src/architecture.md and module-level docs in src/extraction/mod.rs to describe the new deduplication strategy and scoring system. [1] [2] [3] [4]

Other improvements:

  • Minor cleanup in README.md for formatting.

These changes collectively modernize the project's dependencies, improve CI reliability, and provide a more robust and well-documented string deduplication pipeline.

- Added a new module for string deduplication that groups duplicate strings by (text, encoding) while preserving all occurrence metadata.
- Introduced `CanonicalString` and `StringOccurrence` structures to represent deduplicated strings and their occurrences.
- Enhanced the extraction process to include deduplication options in the `ExtractionConfig`, allowing users to enable/disable deduplication and set thresholds for deduplication.
- Updated documentation to reflect the new deduplication features and provided examples for usage.
- Added integration tests to validate the deduplication functionality and ensure metadata preservation across different scenarios.

This enhancement significantly improves the library's ability to manage and analyze extracted strings, facilitating better binary analysis.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bumped `cargo-dist-version` from 0.30.2 to 0.30.3 for the latest features.
- Updated GitHub Actions dependencies:
  - `actions/checkout` from v5 to v6
  - `actions/download-artifact` from v6 to v7
  - `actions/upload-artifact` from v5 to v6

These updates enhance CI/CD performance and ensure compatibility with the latest features.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…I/CD workflows

- Updated `actions/checkout` from v5 to v6 across multiple workflows for enhanced performance and compatibility.
- Adjusted `actions/download-artifact` from v6 to v7 and `actions/upload-artifact` from v5 to v6 to leverage new features.
- Ensured consistency in the usage of the latest versions across all workflows.

These updates enhance the reliability and efficiency of the CI/CD processes.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bumped `clap` to version 4.5.54 for improved functionality.
- Updated `goblin` to version 0.10.4 for better compatibility.
- Upgraded `serde_json` to version 1.0.148 to incorporate the latest features and fixes.
- Updated `insta` to version 1.46.0 and `tempfile` to version 3.24.0 for enhanced testing capabilities.

These updates ensure the project utilizes the latest versions of dependencies, improving overall stability and performance.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r linked an issue Jan 4, 2026 that may be closed by this pull request
13 tasks
@unclesp1d3r unclesp1d3r self-assigned this Jan 4, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 4, 2026

Caution

Review failed

Failed to post review comments

Summary by CodeRabbit

  • New Features

    • String deduplication: groups duplicate strings, preserves all occurrence metadata, merges tags, and ranks results using an improved scoring scheme.
  • Chores

    • Bumped several dependencies and developer tools.
    • Upgraded workflow actions and release tooling across CI/CD.
    • Minor documentation and configuration cleanups (formatting and release settings).

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

This PR adds a new string deduplication system (preserving per-occurrence metadata, tag merging, and combined scoring), integrates it into the extraction API/flow, adds tests and docs, bumps several crate and CI action versions, and updates release/CI workflows and workspace release settings.

Changes

Cohort / File(s) Summary
Core Deduplication Module
src/extraction/dedup.rs
New module: CanonicalString, StringOccurrence, deduplicate(), found_string_to_occurrence(), CanonicalString::to_found_string(), scoring/merge helpers, and comprehensive unit tests implementing grouping by (text, encoding), occurrence preservation, tag merging, and combined scoring.
Extraction Pipeline Integration
src/extraction/mod.rs, src/lib.rs
Exposed pub mod dedup and re-exports; added enable_deduplication, dedup_threshold, preserve_all_occurrences to ExtractionConfig (defaults enabled/preserve); extended StringExtractor trait with extract_canonical(); updated BasicExtractor::extract() to optionally run dedup and convert canonical results back to FoundString; added public re-exports in src/lib.rs.
Integration Tests
tests/test_deduplication.rs
New integration tests covering extractor behavior with/without dedup, metadata preservation across sections, fixture-based comparison, combined-score bonus calculations, and canonical vs deduped occurrence preservation.
Documentation
docs/src/architecture.md, README.md, docs/book.toml
Added deduplication step and detailed dedup system description (grouping, occurrence preservation, tag merging, scoring) to architecture docs; minor README formatting cleanup; removed empty/explicit flags in book config.
CI & Release Workflows
.github/workflows/... (audit.yml, ci.yml, codeql.yml, copilot-setup-steps.yml, docs.yml, release.yml, security.yml)
Bumped actions/checkout v5→v6 across workflows; release workflow also bumped upload-artifact v5→v6, download-artifact v6→v7, cargo-dist installer URL/version; copilot-setup-steps.yml swapped a just installer action for extractions/setup-just@v3.
Manifests / Workspace
Cargo.toml, dist-workspace.toml
Dependency bumps: clap (4.5.51→4.5.54), goblin (0.10.3→0.10.4), serde_json (1.0.145→1.0.148); dev deps insta (1.43.2→1.46.0), tempfile (3.23.0→3.24.0); cargo-dist-version 0.30.2→0.30.3 and corresponding action version updates in workspace config.

Sequence Diagram

sequenceDiagram
    participant Binary as Binary Input
    participant Extractor as BasicExtractor / StringExtractor
    participant Dedup as Deduplication Module
    participant Scorer as Scoring Engine
    participant Output as FoundString / CanonicalString

    Binary->>Extractor: extract()
    activate Extractor
    Extractor->>Extractor: scan sections -> emit FoundString list

    alt enable_deduplication == true
        Extractor->>Dedup: deduplicate(found_strings, threshold, preserve_all)
        activate Dedup
        Dedup->>Dedup: group by (text, encoding)
        Dedup->>Dedup: merge tags, collect occurrences
        Dedup->>Scorer: calculate_combined_score(group)
        activate Scorer
        Note over Scorer: base + occurrence_bonus\n+ cross_section_bonus\n+ multi_source_bonus\n+ confidence_boost
        Scorer-->>Dedup: combined_score
        deactivate Scorer
        Dedup->>Dedup: sort groups by combined_score DESC
        Dedup-->>Extractor: Vec<CanonicalString>
        deactivate Dedup

        Extractor->>Output: convert CanonicalString -> FoundString (for backward compat)
    else
        Extractor-->>Output: Vec<FoundString> (raw)
    end

    deactivate Extractor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

🐰 I hopped through bytes and merged each ring,

Occurrences gathered, tags take wing,
Scores combined until the best shines through,
One canonical string — tidy and true,
Hooray, I thumped, the pipeline's new!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature: implementing string deduplication with metadata preservation, which matches the primary changes in src/extraction/dedup.rs and related extraction module updates.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering both the main deduplication system improvements and supporting dependency/workflow updates throughout the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 13-implement-string-deduplication-with-metadata-preservation

Comment @coderabbitai help to get the list of available commands and usage tips.

@unclesp1d3r unclesp1d3r marked this pull request as ready for review January 4, 2026 21:40
… unused preprocessor alerts

- Removed the `multilingual` setting from the book configuration as it is no longer needed.
- Deleted the unused `[preprocessor.alerts]` section to clean up the configuration file.

These changes streamline the book configuration, improving clarity and maintainability.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r merged commit e971b47 into main Jan 4, 2026
17 of 19 checks passed
@unclesp1d3r unclesp1d3r deleted the 13-implement-string-deduplication-with-metadata-preservation branch January 4, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement String Deduplication with Metadata Preservation

2 participants