Implement String Deduplication with Metadata Preservation

## Problem Statement

Currently, the string extraction pipeline generates multiple `FoundString` entries for identical string text that appears in different locations within a binary. This creates significant noise in the output and makes analysis difficult.

**Example scenario:**
- The string "https://example.com" appears in:
  - `.rodata` section at offset 0x1000
  - `.data` section at offset 0x3500
  - PE resource section at offset 0x8000

Without deduplication, three separate `FoundString` entries are emitted, even though they represent the same semantic content.

## Goals

Implement intelligent string deduplication that:
1. **Reduces noise** by consolidating duplicate string text into canonical entries
2. **Preserves metadata** from all occurrences (offsets, sections, sources)
3. **Merges tags intelligently** to combine semantic classifications
4. **Calculates combined relevance scores** based on occurrence patterns
5. **Maintains traceability** so analysts can see all locations where a string appears

## Proposed Solution

### Implementation Plan

Create `src/extraction/dedup.rs` with the following components:

#### 1. Canonical String Structure
```rust
pub struct CanonicalString {
    pub text: String,
    pub encoding: Encoding,
    pub occurrences: Vec<StringOccurrence>,
    pub merged_tags: Vec<Tag>,
    pub combined_score: i32,
}

pub struct StringOccurrence {
    pub offset: u64,
    pub rva: Option<u64>,
    pub section: Option<String>,
    pub source: StringSource,
    pub original_tags: Vec<Tag>,
    pub original_score: i32,
}
```

#### 2. Deduplication Algorithm

```rust
pub fn deduplicate(strings: Vec<FoundString>) -> Vec<CanonicalString> {
    // 1. Group strings by (text, encoding) key
    // 2. For each group, create CanonicalString with:
    //    - All occurrences preserved
    //    - Union of all tags across occurrences
    //    - Combined score = max(scores) + occurrence_bonus
    // 3. Sort by combined_score descending
}
```

#### 3. Scoring Strategy

- **Base score**: Take the maximum score from all occurrences
- **Occurrence bonus**: +5 points for each additional occurrence (indicates importance)
- **Cross-section bonus**: +10 if string appears in multiple section types
- **Multi-source bonus**: +15 if string appears from different sources (e.g., both section data and imports)

#### 4. Integration Points

Update the extraction pipeline in the architecture:
```
String Extraction → **Deduplication** → Classification → Ranking → Output
```

### Output Format Considerations

The deduplicated output should show:
- Primary metadata from the "best" occurrence (highest original score)
- Count of total occurrences
- Optional verbose mode to list all locations

**JSONL format:**
```json
{
  "text": "https://example.com",
  "encoding": "Utf8",
  "occurrences": 3,
  "locations": [
    {"offset": 4096, "section": ".rodata", "source": "SectionData"},
    {"offset": 13568, "section": ".data", "source": "SectionData"},
    {"offset": 32768, "section": ".rsrc", "source": "ResourceString"}
  ],
  "tags": ["Url"],
  "score": 95
}
```

## Acceptance Criteria

- [ ] `src/extraction/dedup.rs` module created with deduplication logic
- [ ] `deduplicate()` function handles identical text with different encodings separately
- [ ] All occurrence metadata is preserved in the canonical structure
- [ ] Tags are merged intelligently (union, with duplicates removed)
- [ ] Combined scoring algorithm implemented with occurrence bonuses
- [ ] Unit tests covering:
  - [ ] Basic deduplication of identical strings
  - [ ] Metadata preservation from multiple occurrences
  - [ ] Tag merging logic
  - [ ] Score calculation with bonuses
  - [ ] Edge cases (empty strings, single occurrence, etc.)
- [ ] Integration test with real binary containing duplicate strings
- [ ] Documentation updated in `docs/src/architecture.md`

## Technical Considerations

### Edge Cases
- **Empty strings**: Should they be deduplicated or filtered?
- **Whitespace variations**: Are " test" and "test " the same?
- **Encoding normalization**: How to handle UTF-8 vs UTF-16 of same text?

### Performance
- Use `HashMap<(String, Encoding), Vec<FoundString>>` for efficient grouping
- Consider memory implications for binaries with thousands of duplicate strings
- Benchmark against large binaries (>100MB) to ensure acceptable performance

### Configuration
Consider adding configuration options:
- `--no-dedup`: Skip deduplication for raw output
- `--dedup-threshold`: Only deduplicate strings appearing N+ times
- `--dedup-mode`: [strict|normalized] for handling whitespace

## Related Files

- `src/types.rs`: Contains `FoundString` structure definition
- `src/extraction/mod.rs`: Main extraction module (currently placeholder)
- `docs/src/architecture.md`: Architecture documentation to update

## Task Context

**Requirements:** 2.5  
**Task-ID:** stringy-analyzer/string-deduplication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement String Deduplication with Metadata Preservation #13

Problem Statement

Goals

Proposed Solution

Implementation Plan

1. Canonical String Structure

2. Deduplication Algorithm

3. Scoring Strategy

4. Integration Points

Output Format Considerations

Acceptance Criteria

Technical Considerations

Edge Cases

Performance

Configuration

Related Files

Task Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Implement String Deduplication with Metadata Preservation #13

Description

Problem Statement

Goals

Proposed Solution

Implementation Plan

1. Canonical String Structure

2. Deduplication Algorithm

3. Scoring Strategy

4. Integration Points

Output Format Considerations

Acceptance Criteria

Technical Considerations

Edge Cases

Performance

Configuration

Related Files

Task Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions