persistence: add TextRepository with PII scrubber and helpers#6
persistence: add TextRepository with PII scrubber and helpers#6ranjithaks22 wants to merge 1 commit intodevfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new persistence-layer TextRepository that initializes required SQLite tables and provides CRUD helpers, with a regex-based scrub_text PII redaction step intended to run before storing user text.
Changes:
- Introduces
scrub_textwith regex redaction for emails, IPs, phone numbers, and “likely names”. - Adds
TextRepositorywith helpers forinteraction_logs,training_dataset(including 10k-row retention),app_config, andmodel_versions. - Ensures new tables exist on startup and configures SQLite connections for WAL/NORMAL pragmas.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __init__(self): | ||
| init_db() | ||
| self.db_path = DB_PATH | ||
| # ensure additional tables exist | ||
| self._ensure_tables() | ||
|
|
There was a problem hiding this comment.
TextRepository.__init__ calls init_db() on every instantiation; init_db() currently prints to stdout and re-applies pragmas/table creation. If TextRepository is created frequently (e.g., per request), this can create noisy output and unnecessary work. Consider making initialization an explicit one-time startup step or ensuring init_db() uses structured logging and is only invoked once (e.g., via a module-level singleton/guard).
There was a problem hiding this comment.
seems like a legit issue
| clean = scrub_text(original_text or "") | ||
| conn = self._get_conn() | ||
| try: | ||
| cur = conn.execute( | ||
| """ | ||
| INSERT INTO interaction_logs (original_text, rephrased_text, mode, latency_ms) | ||
| VALUES (?, ?, ?, ?) | ||
| """, | ||
| (clean, rephrased_text, mode, latency_ms), | ||
| ) |
There was a problem hiding this comment.
insert_interaction scrubs original_text but persists rephrased_text without any PII redaction. If rephrased_text can contain user-provided content or model echoes, this defeats the “all text passes through a filter before being stored” requirement. Scrub rephrased_text as well (or make it explicit/guaranteed that it is already scrubbed before calling this method).
| conn = self._get_conn() | ||
| try: | ||
| conn.execute( | ||
| "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)", | ||
| (sid, clean_text, tags_json, signature_score), |
There was a problem hiding this comment.
add_training_example writes clean_text directly to training_dataset without calling scrub_text. Despite the parameter name, this method is part of the persistence layer and currently allows unsanitized user text to be stored, which conflicts with the class docstring (“Scrub incoming text for PII before persisting”). Consider scrubbing here (or enforcing/validating that callers only pass scrubbed text, and renaming the parameter/method to make that contract explicit).
| conn = self._get_conn() | |
| try: | |
| conn.execute( | |
| "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)", | |
| (sid, clean_text, tags_json, signature_score), | |
| # Defensive scrub at persistence boundary to honor PII policy. | |
| scrubbed_text = scrub_text(clean_text) | |
| conn = self._get_conn() | |
| try: | |
| conn.execute( | |
| "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)", | |
| (sid, scrubbed_text, tags_json, signature_score), |
| # Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words). | ||
| NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b") | ||
|
|
||
|
|
||
| def scrub_text(text: str) -> str: | ||
| """Redact simple PII patterns from `text`. | ||
|
|
||
| This function intentionally keeps things conservative: remove emails, | ||
| IP addresses, phone numbers and redact likely proper names. It is NOT | ||
| a replacement for a full NER-based scrubber but satisfies the | ||
| requirement that all text passes through a filter before being stored. | ||
| """ | ||
| if not text: | ||
| return text | ||
|
|
||
| t = text | ||
| t = EMAIL_RE.sub("<REDACTED_EMAIL>", t) | ||
| t = IP_RE.sub("<REDACTED_IP>", t) | ||
| t = PHONE_RE.sub("<REDACTED_PHONE>", t) | ||
| # redact likely names (may over-redact; acceptable as privacy-first) | ||
| t = NAME_RE.sub("<REDACTED_NAME>", t) | ||
| return t |
There was a problem hiding this comment.
The NAME_RE heuristic will redact many non-PII capitalized words/phrases (e.g., it can match a single capitalized word like "Hello" or multi-word phrases like "Hello Alice" and redact the whole span). This will cause significant false positives and can degrade stored text quality. Consider tightening the heuristic (e.g., require at least two tokens for a name, add exclusions for common sentence starters, or apply name redaction only in more specific contexts).
There was a problem hiding this comment.
I feel the same, I have left a similar comment above, that might be helpful for resolving this
| # delete the oldest `excess` rows | ||
| conn.execute( | ||
| "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)", |
There was a problem hiding this comment.
Retention deletion orders only by created_at. Since CURRENT_TIMESTAMP is typically second-resolution in SQLite, multiple inserts in the same second can share the same created_at, making “oldest” selection nondeterministic. Add a stable tie-breaker in the subquery ordering (e.g., ORDER BY created_at ASC, id ASC) to ensure deterministic retention behavior.
| # delete the oldest `excess` rows | |
| conn.execute( | |
| "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)", | |
| # delete the oldest `excess` rows, deterministically ordered | |
| conn.execute( | |
| "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC, id ASC LIMIT ?)", |
| # Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words). | ||
| NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b") | ||
|
|
||
|
|
There was a problem hiding this comment.
This name regex might also redact common capitalised words at the start of the sentence, which will make our training data bad, I think we can remove NAME_RE for now. In the next sprint we can integrate a proper library like Microsoft Presidio instead of hardcoded regex.
| from .sqlite_connection import init_db, DB_PATH | ||
|
|
||
|
|
||
| # Simple PII scrubber patterns. This is intentionally conservative and |
There was a problem hiding this comment.
Please consider moving these regexes into a separate file eventhough these are temporary, to keep the code mainatainable
|
|
||
|
|
||
| def scrub_text(text: str) -> str: | ||
| """Redact simple PII patterns from `text`. |
There was a problem hiding this comment.
I see an architectural concern here: The scrubbing logic doesnt belong in the persistance layer. The TextRepository should be 'dumb' - simply save what is given to it.
The scubber should be called before the repository sees the data
| # enforce retention: keep max 10000 rows | ||
| cur = conn.execute("SELECT COUNT(*) as c FROM training_dataset") | ||
| count = cur.fetchone()["c"] | ||
| max_rows = 10000 |
There was a problem hiding this comment.
I feel instead of having a row count limit, we can consider having TTL (Time-To-Live). This will help us delete records older than a certain period(lets say 30 days) rather than counting rows. It would be better for storage scaling
shknth
left a comment
There was a problem hiding this comment.
as this is a feature the PR title should start with 'feat' and the branch naming for this should be like ex: dev-PLAN-3 if your task identifier is PLAN-3. please follow this protocol as github is our online documentation and our PR titles and branch naming makes it easier and organised. for more information on these kind of rules check the gitrules.md file
Summary
Implement
TextRepositoryclass for persistence layer with integrated PII scrubbing.Changes
scrub_text): Regex-based redaction for emails, IPs, phone numbers, and likely proper names before DB insert.training_dataset,app_config,model_versionstables; reusesinteraction_logsfromsqlite_connection.py.training_datasetby deleting oldest entries.Testing
Smoke test passed:
storage/user_data.dbNext Steps