persistence: add TextRepository with PII scrubber and helpers#6

Open

ranjithaks22 wants to merge 1 commit intodevfrom

feat/text-repository-pr

Collaborator

ranjithaks22 commented Feb 16, 2026

Summary

Implement TextRepository class for persistence layer with integrated PII scrubbing.

Changes

TextRepository class: CRUD helpers for interaction logs, training dataset, config, and model versions.
PII scrubber (scrub_text): Regex-based redaction for emails, IPs, phone numbers, and likely proper names before DB insert.
DB schema: Creates training_dataset, app_config, model_versions tables; reuses interaction_logs from sqlite_connection.py.
Retention policy: Enforces max 10,000 rows in training_dataset by deleting oldest entries.
WAL mode: Leverages SQLite WAL for concurrent reads during writes.

Testing

Smoke test passed:

DB created at storage/user_data.db
Text scrubbed (e.g., "Hello Alice, email alice@example.com" → "Hello <REDACTED_NAME>, email <REDACTED_EMAIL>")
Row inserted and fetched successfully.

Next Steps

Add pytest unit tests for scrubber and repository methods.
Wire TextRepository into listener pipeline for Path B (Cold Path) data collection.
Consider NER-based scrubber for finer-grained PII redaction.


          persistence: add TextRepository with PII scrubber and helpers

6a695cb

radhaneelamani requested a review from Copilot

February 19, 2026 17:32

Copilot started reviewing on behalf of radhaneelamani

February 19, 2026 17:32

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

Adds a new persistence-layer TextRepository that initializes required SQLite tables and provides CRUD helpers, with a regex-based scrub_text PII redaction step intended to run before storing user text.

Changes:

Introduces scrub_text with regex redaction for emails, IPs, phone numbers, and “likely names”.
Adds TextRepository with helpers for interaction_logs, training_dataset (including 10k-row retention), app_config, and model_versions.
Ensures new tables exist on startup and configures SQLite connections for WAL/NORMAL pragmas.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/src/idio/infrastructure/persistence/text_repository.py

Comment on lines +48 to +53

+                  def __init__(self):
+                      init_db()
+                      self.db_path = DB_PATH
+                      # ensure additional tables exist
+                      self._ensure_tables()

Copilot AI Feb 19, 2026

TextRepository.__init__ calls init_db() on every instantiation; init_db() currently prints to stdout and re-applies pragmas/table creation. If TextRepository is created frequently (e.g., per request), this can create noisy output and unnecessary work. Consider making initialization an explicit one-time startup step or ensuring init_db() uses structured logging and is only invoked once (e.g., via a module-level singleton/guard).

Copilot uses AI. Check for mistakes.

Collaborator

radhaneelamani Feb 19, 2026

seems like a legit issue

backend/src/idio/infrastructure/persistence/text_repository.py

Comment on lines +102 to +111

+                      clean = scrub_text(original_text or "")
+                      conn = self._get_conn()
+                      try:
+                          cur = conn.execute(
+                              """
+                              INSERT INTO interaction_logs (original_text, rephrased_text, mode, latency_ms)
+                              VALUES (?, ?, ?, ?)
+                              """,
+                              (clean, rephrased_text, mode, latency_ms),
+                          )

Copilot AI Feb 19, 2026

insert_interaction scrubs original_text but persists rephrased_text without any PII redaction. If rephrased_text can contain user-provided content or model echoes, this defeats the “all text passes through a filter before being stored” requirement. Scrub rephrased_text as well (or make it explicit/guaranteed that it is already scrubbed before calling this method).

Copilot uses AI. Check for mistakes.

backend/src/idio/infrastructure/persistence/text_repository.py

Comment on lines +137 to +141

+                      conn = self._get_conn()
+                      try:
+                          conn.execute(
+                              "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
+                              (sid, clean_text, tags_json, signature_score),

Copilot AI Feb 19, 2026

add_training_example writes clean_text directly to training_dataset without calling scrub_text. Despite the parameter name, this method is part of the persistence layer and currently allows unsanitized user text to be stored, which conflicts with the class docstring (“Scrub incoming text for PII before persisting”). Consider scrubbing here (or enforcing/validating that callers only pass scrubbed text, and renaming the parameter/method to make that contract explicit).

Suggested change

      
                    conn = self._get_conn()
          
                    try:
          
                        conn.execute(
          
                            "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
          
                            (sid, clean_text, tags_json, signature_score),
          
                    # Defensive scrub at persistence boundary to honor PII policy.
          
                    scrubbed_text = scrub_text(clean_text)
          
                    conn = self._get_conn()
          
                    try:
          
                        conn.execute(
          
                            "INSERT OR REPLACE INTO training_dataset (id, clean_text, style_tags, signature_score) VALUES (?, ?, ?, ?)",
          
                            (sid, scrubbed_text, tags_json, signature_score),

Copilot uses AI. Check for mistakes.

backend/src/idio/infrastructure/persistence/text_repository.py

Comment on lines +15 to +36

+              # Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words).
+              NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b")
+              def scrub_text(text: str) -> str:
+                  """Redact simple PII patterns from `text`.
+                  This function intentionally keeps things conservative: remove emails,
+                  IP addresses, phone numbers and redact likely proper names. It is NOT
+                  a replacement for a full NER-based scrubber but satisfies the
+                  requirement that all text passes through a filter before being stored.
+                  """
+                  if not text:
+                      return text
+                  t = text
+                  t = EMAIL_RE.sub("<REDACTED_EMAIL>", t)
+                  t = IP_RE.sub("<REDACTED_IP>", t)
+                  t = PHONE_RE.sub("<REDACTED_PHONE>", t)
+                  # redact likely names (may over-redact; acceptable as privacy-first)
+                  t = NAME_RE.sub("<REDACTED_NAME>", t)
+                  return t

Copilot AI Feb 19, 2026

The NAME_RE heuristic will redact many non-PII capitalized words/phrases (e.g., it can match a single capitalized word like "Hello" or multi-word phrases like "Hello Alice" and redact the whole span). This will cause significant false positives and can degrade stored text quality. Consider tightening the heuristic (e.g., require at least two tokens for a name, add exclusions for common sentence starters, or apply name redaction only in more specific contexts).

Copilot uses AI. Check for mistakes.

Collaborator

radhaneelamani Feb 19, 2026

I feel the same, I have left a similar comment above, that might be helpful for resolving this

backend/src/idio/infrastructure/persistence/text_repository.py

Comment on lines +150 to +152

+                              # delete the oldest `excess` rows
+                              conn.execute(
+                                  "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)",

Copilot AI Feb 19, 2026

Retention deletion orders only by created_at. Since CURRENT_TIMESTAMP is typically second-resolution in SQLite, multiple inserts in the same second can share the same created_at, making “oldest” selection nondeterministic. Add a stable tie-breaker in the subquery ordering (e.g., ORDER BY created_at ASC, id ASC) to ensure deterministic retention behavior.

Suggested change

      
                            # delete the oldest `excess` rows
          
                            conn.execute(
          
                                "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC LIMIT ?)",
          
                            # delete the oldest `excess` rows, deterministically ordered
          
                            conn.execute(
          
                                "DELETE FROM training_dataset WHERE id IN (SELECT id FROM training_dataset ORDER BY created_at ASC, id ASC LIMIT ?)",

Copilot uses AI. Check for mistakes.

radhaneelamani reviewed

View reviewed changes

backend/src/idio/infrastructure/persistence/text_repository.py

		# Very small heuristic to catch Proper Names (first+last or 1-3 capitalized words).
		NAME_RE = re.compile(r"\b([A-Z][a-z]{1,}\s){0,2}[A-Z][a-z]{1,}\b")

Collaborator

radhaneelamani Feb 19, 2026

This name regex might also redact common capitalised words at the start of the sentence, which will make our training data bad, I think we can remove NAME_RE for now. In the next sprint we can integrate a proper library like Microsoft Presidio instead of hardcoded regex.

radhaneelamani reviewed

View reviewed changes

backend/src/idio/infrastructure/persistence/text_repository.py

		from .sqlite_connection import init_db, DB_PATH


		# Simple PII scrubber patterns. This is intentionally conservative and

Collaborator

radhaneelamani Feb 19, 2026

Please consider moving these regexes into a separate file eventhough these are temporary, to keep the code mainatainable

radhaneelamani reviewed

View reviewed changes

backend/src/idio/infrastructure/persistence/text_repository.py



		def scrub_text(text: str) -> str:
		"""Redact simple PII patterns from `text`.

Collaborator

radhaneelamani Feb 19, 2026

I see an architectural concern here: The scrubbing logic doesnt belong in the persistance layer. The TextRepository should be 'dumb' - simply save what is given to it.
The scubber should be called before the repository sees the data

radhaneelamani reviewed

View reviewed changes

backend/src/idio/infrastructure/persistence/text_repository.py

+                          # enforce retention: keep max 10000 rows
+                          cur = conn.execute("SELECT COUNT(*) as c FROM training_dataset")
+                          count = cur.fetchone()["c"]
+                          max_rows = 10000

Collaborator

radhaneelamani Feb 19, 2026

I feel instead of having a row count limit, we can consider having TTL (Time-To-Live). This will help us delete records older than a certain period(lets say 30 days) rather than counting rows. It would be better for storage scaling

radhaneelamani requested changes

View reviewed changes

shknth reviewed

View reviewed changes

Owner

shknth left a comment

as this is a feature the PR title should start with 'feat' and the branch naming for this should be like ex: dev-PLAN-3 if your task identifier is PLAN-3. please follow this protocol as github is our online documentation and our PR titles and branch naming makes it easier and organised. for more information on these kind of rules check the gitrules.md file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet