Skip to content

Comments

persistence: add TextRepository with PII scrubber and helpers#6

Open
ranjithaks22 wants to merge 1 commit intodevfrom
feat/text-repository-pr
Open

persistence: add TextRepository with PII scrubber and helpers#6
ranjithaks22 wants to merge 1 commit intodevfrom
feat/text-repository-pr

Conversation

@ranjithaks22
Copy link
Collaborator

Summary

Implement TextRepository class for persistence layer with integrated PII scrubbing.

Changes

  • TextRepository class: CRUD helpers for interaction logs, training dataset, config, and model versions.
  • PII scrubber (scrub_text): Regex-based redaction for emails, IPs, phone numbers, and likely proper names before DB insert.
  • DB schema: Creates training_dataset, app_config, model_versions tables; reuses interaction_logs from sqlite_connection.py.
  • Retention policy: Enforces max 10,000 rows in training_dataset by deleting oldest entries.
  • WAL mode: Leverages SQLite WAL for concurrent reads during writes.

Testing

Smoke test passed:

  • DB created at storage/user_data.db
  • Text scrubbed (e.g., "Hello Alice, email alice@example.com" → "Hello <REDACTED_NAME>, email <REDACTED_EMAIL>")
  • Row inserted and fetched successfully.

Next Steps

  • Add pytest unit tests for scrubber and repository methods.
  • Wire TextRepository into listener pipeline for Path B (Cold Path) data collection.
  • Consider NER-based scrubber for finer-grained PII redaction.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new persistence-layer TextRepository that initializes required SQLite tables and provides CRUD helpers, with a regex-based scrub_text PII redaction step intended to run before storing user text.

Changes:

  • Introduces scrub_text with regex redaction for emails, IPs, phone numbers, and “likely names”.
  • Adds TextRepository with helpers for interaction_logs, training_dataset (including 10k-row retention), app_config, and model_versions.
  • Ensures new tables exist on startup and configures SQLite connections for WAL/NORMAL pragmas.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +53
def __init__(self):
init_db()
self.db_path = DB_PATH
# ensure additional tables exist
self._ensure_tables()

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextRepository.__init__ calls init_db() on every instantiation; init_db() currently prints to stdout and re-applies pragmas/table creation. If TextRepository is created frequently (e.g., per request), this can create noisy output and unnecessary work. Consider making initialization an explicit one-time startup step or ensuring init_db() uses structured logging and is only invoked once (e.g., via a module-level singleton/guard).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a legit issue

Comment on lines +102 to +111
clean = scrub_text(original_text or "")
conn = self._get_conn()
try:
cur = conn.execute(
"""
INSERT INTO interaction_logs (original_text, rephrased_text, mode, latency_ms)
VALUES (?, ?, ?, ?)
""",
(clean, rephrased_text, mode, latency_ms),
)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_interaction scrubs original_text but persists rephrased_text without any PII redaction. If rephrased_text can contain user-provided content or model echoes, this defeats the “all text passes through a filter before being stored” requirement. Scrub rephrased_text as well (or make it explicit/guaranteed that it is already scrubbed before calling this method).

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +141
conn = self._get_conn()
try:
conn.execute(
"INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
(sid, clean_text, tags_json, signature_score),
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_training_example writes clean_text directly to training_dataset without calling scrub_text. Despite the parameter name, this method is part of the persistence layer and currently allows unsanitized user text to be stored, which conflicts with the class docstring (“Scrub incoming text for PII before persisting”). Consider scrubbing here (or enforcing/validating that callers only pass scrubbed text, and renaming the parameter/method to make that contract explicit).

Suggested change
conn = self._get_conn()
try:
conn.execute(
"INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
(sid, clean_text, tags_json, signature_score),
# Defensive scrub at persistence boundary to honor PII policy.
scrubbed_text = scrub_text(clean_text)
conn = self._get_conn()
try:
conn.execute(
"INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
(sid, scrubbed_text, tags_json, signature_score),

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +36
# Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words).
NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b")


def scrub_text(text: str) -> str:
"""Redact simple PII patterns from `text`.

This function intentionally keeps things conservative: remove emails,
IP addresses, phone numbers and redact likely proper names. It is NOT
a replacement for a full NER-based scrubber but satisfies the
requirement that all text passes through a filter before being stored.
"""
if not text:
return text

t = text
t = EMAIL_RE.sub("<REDACTED_EMAIL>", t)
t = IP_RE.sub("<REDACTED_IP>", t)
t = PHONE_RE.sub("<REDACTED_PHONE>", t)
# redact likely names (may over-redact; acceptable as privacy-first)
t = NAME_RE.sub("<REDACTED_NAME>", t)
return t
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NAME_RE heuristic will redact many non-PII capitalized words/phrases (e.g., it can match a single capitalized word like "Hello" or multi-word phrases like "Hello Alice" and redact the whole span). This will cause significant false positives and can degrade stored text quality. Consider tightening the heuristic (e.g., require at least two tokens for a name, add exclusions for common sentence starters, or apply name redaction only in more specific contexts).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the same, I have left a similar comment above, that might be helpful for resolving this

Comment on lines +150 to +152
# delete the oldest `excess` rows
conn.execute(
"DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)",
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention deletion orders only by created_at. Since CURRENT_TIMESTAMP is typically second-resolution in SQLite, multiple inserts in the same second can share the same created_at, making “oldest” selection nondeterministic. Add a stable tie-breaker in the subquery ordering (e.g., ORDER BY created_at ASC, id ASC) to ensure deterministic retention behavior.

Suggested change
# delete the oldest `excess` rows
conn.execute(
"DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)",
# delete the oldest `excess` rows, deterministically ordered
conn.execute(
"DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC, id ASC LIMIT ?)",

Copilot uses AI. Check for mistakes.
# Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words).
NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b")


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name regex might also redact common capitalised words at the start of the sentence, which will make our training data bad, I think we can remove NAME_RE for now. In the next sprint we can integrate a proper library like Microsoft Presidio instead of hardcoded regex.

from .sqlite_connection import init_db, DB_PATH


# Simple PII scrubber patterns. This is intentionally conservative and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider moving these regexes into a separate file eventhough these are temporary, to keep the code mainatainable



def scrub_text(text: str) -> str:
"""Redact simple PII patterns from `text`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see an architectural concern here: The scrubbing logic doesnt belong in the persistance layer. The TextRepository should be 'dumb' - simply save what is given to it.
The scubber should be called before the repository sees the data

# enforce retention: keep max 10000 rows
cur = conn.execute("SELECT COUNT(*) as c FROM training_dataset")
count = cur.fetchone()["c"]
max_rows = 10000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel instead of having a row count limit, we can consider having TTL (Time-To-Live). This will help us delete records older than a certain period(lets say 30 days) rather than counting rows. It would be better for storage scaling

Copy link
Owner

@shknth shknth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is a feature the PR title should start with 'feat' and the branch naming for this should be like ex: dev-PLAN-3 if your task identifier is PLAN-3. please follow this protocol as github is our online documentation and our PR titles and branch naming makes it easier and organised. for more information on these kind of rules check the gitrules.md file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants