Skip to content

Conversation

@tu2463
Copy link
Contributor

@tu2463 tu2463 commented Jan 31, 2026

Description

This PR adds a scraping pipeline to collect floor plan SVGs from FMSystems.

It adds a Selenium-based web driver that opens FMSystems and programmatically clicks through each floor of the specified buildings. This step ensures that the required cookies are initialized for every floor, which is necessary for successfully retrieving the corresponding SVG data.

Once the cookies are set, a separate script uses the request headers and cookies of your browser to fetch and store the SVG files.

It also adds scripts that uploads to and downloads from S3.

Changes

driver.py: A Selenium web driver that opens FMSystems and clicks through all floors for the selected buildings to initialize cookies.
get_all_svg.py: Iterates over buildings defined in buildings.json, retrieves their floor plan SVGs, and saves them locally.
Notes
upload_svg_to_s3.py: Uploads the floorplan_svg folder with all content to S3.
download_svg_from_s3.py: Downloads the floorplan_svg folder from S3.
s3_utils.py: copied from CMU Maps data acquisitor repo

The following floors cannot be successfully scraped because they do not have a "draw file":
Hall of Arts: 4th floor
Posner Hall: 1M

The following buildings are not found in FM Systems:

  • forbes-ave-4612
  • sh (possibly Scaife Hall, but the floor ids of the "Alan Magee Scaife Hall of Engineering" on FMS are different from the ones we scraped)
  • sterling-plaza
  • wf (unknown building)

How to ### Test

  1. Download buildings.json and place it in the fms folder
  2. In driver.py, ensure the building_for_process array contains the labels of all buildings you want to scrape.
  3. Run driver.py.
  4. A Chrome window will open with the FMSystems login page. Log in using your Andrew ID.
  5. Once the FMSystems main page loads, press Enter in the terminal.
  6. Wait for the script to click through all floors (~10 seconds per floor).

After all clicks finish:
8. Open Chrome DevTools → Network panel.
9. Manually click any floor.
10. Locate the getDefaultLayersData request.
11. Copy its cURL command.
12. Convert the cURL command using a cURL to code converter.
13. Copy the cookies and headers into get_all_svg.py.
14. Run get_all_svg.py.
15. Confirm that the SVG files are saved in the floorplan_svg/ directory.

Summary by CodeRabbit

  • New Features
    • Added end-to-end SVG workflows: bulk download from storage, bulk upload to storage, and a CLI to fetch campus floor-plan SVGs with retries and configurable options.
    • Added browser automation to navigate the FMSystems UI for building/floor extraction.
    • Added storage utilities for uploading/downloading files and JSON, with console progress and summary reporting.

✏️ Tip: You can customize this high-level summary in your review settings.

@vercel
Copy link

vercel bot commented Jan 31, 2026

@tu2463 is attempting to deploy a commit to the ScottyLabs Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

Adds a new FMS scraping toolset: Selenium UI driver, HTTP SVG fetcher with retry/backoff, MinIO/S3 utility, and scripts to upload/download floor-plan SVGs between local storage and the cmumaps S3 bucket.

Changes

Cohort / File(s) Summary
S3 utilities & upload/download scripts
apps/dataflow/src/scraper/fms/s3_utils.py, apps/dataflow/src/scraper/fms/upload_svg_to_s3.py, apps/dataflow/src/scraper/fms/download_svg_from_s3.py
New MinIO/S3 client wrapper (env-configured for bucket cmumaps) with JSON and generic file upload/download, listing, and helpers; scripts to upload local SVGs from floorplan_svg and to download .svg objects from S3 prefix floorplan_svg/ to a local folder.
SVG fetching (HTTP) and CLI
apps/dataflow/src/scraper/fms/get_all_svg.py
New CLI that reads buildings.json, slugifies building names, builds CMU API requests per floor, fetches SVGs with retry/exponential backoff and configurable SSL/timeouts, and saves SVGs into slugified directories.
Selenium driver / UI automation
apps/dataflow/src/scraper/fms/driver.py
New Selenium-based Chrome WebDriver flow that navigates to LANDING_PAGE_URL, waits for manual login, expands Sites/Main Campus and buildings, opens floors (new tabs), handles modal alerts, and exposes helpers (click_building_and_all_floors, js_click, open_sites_and_expand_main_campus, expand_building).

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "get_all_svg CLI"
  participant CMU as "CMU SVG API"
  participant FS as "Local filesystem"
  participant S3 as "MinIO / S3 (cmumaps)"
  CLI->>CMU: request floorplan SVG (with retries/backoff)
  CMU-->>CLI: SVG response or error
  CLI->>FS: save SVG to slugified path
  FS-->>CLI: write confirmation
  CLI->>S3: optional upload via s3_utils.upload_generic_file
  S3-->>CLI: upload success/failure
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇
I hopped through tabs and API streams,
Collected floors and SVG dreams,
Pushed the files where buckets gleam,
Pulled them back to sight and scheme,
A little rabbit in every beams.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: introducing a web driver for Selenium automation and utilities for scraping and uploading FMS floorplan SVGs to S3.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Fix all issues with AI agents
In `@apps/dataflow/src/scraper/fms/download_svg_from_s3.py`:
- Around line 3-34: The download_all_svg_files function currently maps S3 keys
directly to local_file_path which allows path traversal via "../"; normalize and
validate the mapped path before creating dirs or writing files by resolving the
destination root (LOCAL_DESTINATION) to an absolute/resolved path and resolving
the candidate path (join LOCAL_DESTINATION with relative_path then resolve),
then ensure the resolved candidate path starts with the resolved destination
root; if it does not, skip the object (or sanitize the name) and increment
fail_count; apply this check where obj.object_name -> relative_path ->
local_file_path is computed and before os.makedirs or any file write.

In `@apps/dataflow/src/scraper/fms/driver.py`:
- Around line 74-93: The code assumes window.open succeeded and immediately uses
driver.window_handles[-1] which can equal the main handle if popups are blocked,
causing driver.close() to kill the session; before switching or closing, capture
the current handles (e.g., home_window = driver.current_window_handle and
orig_handles = driver.window_handles), then after execute_script use a short
WebDriverWait loop (or poll) to ensure a new handle appears
(len(driver.window_handles) > len(orig_handles)); if a new handle never appears,
skip switching/alert handling/closing for this iteration and continue, otherwise
switch to the newly added handle, perform the alert check, close that new
handle, and switch back to home_window (use unique symbols like
driver.execute_script, driver.window_handles, driver.switch_to.window,
driver.close, WebDriverWait/TimeoutException to locate where to update).

In `@apps/dataflow/src/scraper/fms/get_all_svg.py`:
- Around line 28-75: The file currently exposes real auth data in the cookies
and headers dicts; remove the hard‑coded cookies and headers (the cookies and
headers variables) and instead load them at runtime from a local secrets source
(e.g., environment variables or a JSON file read by a helper like
load_secrets()) and fail fast if missing; ensure the secret file name is added
to .gitignore and update any README/dev docs to instruct how to create the local
secrets file or set env vars; after updating remove the sensitive values from
history and coordinate rotation/revocation of the exposed tokens.
- Around line 96-145: Change SSL verification to be enabled by default: in
fetch_svg() set the parameter verify_ssl: bool = True (instead of False). In
main(), replace the CLI flag --verify-ssl with an opt-out --insecure flag
(action="store_true") and update the call that invokes fetch_svg(...) to pass
verify_ssl=not args.insecure so verification is enabled unless --insecure is
provided; ensure references to verify_ssl in requests.get(...,
verify=verify_ssl) remain consistent.
- Around line 115-126: The response is written to out_file as an .svg without
validating that it is actually SVG; update the logic in the function using resp
(the requests.get block) to verify the payload before calling
out_file.write_text: check resp.headers.get("Content-Type") for an SVG MIME type
(e.g., contains "image/svg+xml") or ensure resp.text.lstrip().startswith("<svg")
and only then create the parent dir and write the file, otherwise treat it as an
error/expired-session case (return False or raise) so stale HTML/login pages are
not saved as .svg.

In `@apps/dataflow/src/scraper/fms/s3_utils.py`:
- Around line 1-5: The code that writes a downloaded S3 object to local JSON
must ensure the parent directory exists before calling open/write; update the
function that saves the S3 object (the routine that takes s3_object_name and
writes to local_path / local_filename—e.g., the save/download function around
lines 112-118) to call os.path.dirname(local_path) and, if non-empty, create it
with os.makedirs(..., exist_ok=True) prior to opening the file; keep the
existing write logic and handle the case where s3_object_name has no path
component by skipping directory creation.
- Around line 68-88: The upload_generic_file function currently returns True on
success but returns None on exception; update the except block in
upload_generic_file to explicitly return False (and keep the existing print of
the exception) so the function consistently returns a boolean as documented;
reference the upload_generic_file function and the except Exception as e block
to locate where to add the return False.
- Around line 201-221: The docstring in get_generic_file_from_s3 has misaligned
continuation lines; fix the indentation so each wrapped line aligns to a
multiple-of-2 spaces under the docstring start (e.g., align the "return_data"
continuation line with two-space increments to match the surrounding docstring
block), ensuring the triple-quoted description and Args/Returns sections use
consistent 2-space indentation to satisfy editorconfig.
- Around line 8-17: The code currently constructs a Minio client without
checking s3_endpoint; before calling Minio(...) validate that s3_endpoint (the
S3_ENDPOINT env var) is non-empty and raise or log a clear error (e.g.,
ValueError with a message like "S3_ENDPOINT must be set") so initialization of
client = Minio(...) only occurs when s3_endpoint is present; reference the
variables access_key, secret_key, s3_endpoint and the Minio client creation in
your change.

In `@apps/dataflow/src/scraper/fms/upload_svg_to_s3.py`:
- Around line 17-23: The code calls os.listdir(LOCAL_SVG_FOLDER) which will
raise if LOCAL_SVG_FOLDER doesn't exist; before the loop check that
LOCAL_SVG_FOLDER exists and is a directory (using os.path.exists or
os.path.isdir) and if not, fail fast by raising a clear exception or logging an
error and exiting; update the block surrounding the os.listdir call (the loop
that iterates over building in os.listdir(LOCAL_SVG_FOLDER)) to perform this
pre-check and surface a descriptive message including LOCAL_SVG_FOLDER so
downstream callers know why processing stopped.
🧹 Nitpick comments (1)
apps/dataflow/src/scraper/fms/driver.py (1)

97-129: Pass driver into js_click to avoid hidden global state.
This makes the helper reusable/testable and avoids accidental cross‑driver usage.

♻️ Refactor to make `driver` explicit
-def js_click(el):
-    driver.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
-    driver.execute_script("arguments[0].click();", el)
+def js_click(driver, el):
+    driver.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
+    driver.execute_script("arguments[0].click();", el)
@@
-        js_click(campus_plus[0])
+        js_click(driver, campus_plus[0])
@@
-    js_click(campus_plus)
+    js_click(driver, campus_plus)

Comment on lines 3 to 34
import os
from s3_utils import client, bucket_name

# S3 source path
S3_SOURCE = "floorplan_svg/"
# Local destination folder
LOCAL_DESTINATION = "floorplan_svg_downloaded"


def download_all_svg_files():
"""Download all SVG files from S3 floorplan_svg folder"""
success_count = 0
fail_count = 0

# Create local destination folder if it doesn't exist
os.makedirs(LOCAL_DESTINATION, exist_ok=True)

# List all objects in the floorplan_svg folder
objects = client.list_objects(bucket_name, prefix=S3_SOURCE, recursive=True)

for obj in objects:
# Skip if not an SVG file
if not obj.object_name.endswith(".svg"):
continue

# Extract relative path (e.g., "floorplan_svg/building/floor.svg" -> "building/floor.svg")
relative_path = obj.object_name[len(S3_SOURCE):]
local_file_path = os.path.join(LOCAL_DESTINATION, relative_path)

# Create building subfolder if needed
os.makedirs(os.path.dirname(local_file_path), exist_ok=True)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent path traversal when mapping S3 keys to local files.
S3 object names can contain ../; without normalization, this can write outside LOCAL_DESTINATION.

🔐 Constrain writes to the destination directory
-import os
+import os
+from pathlib import Path
@@
-    for obj in objects:
+    base_dir = Path(LOCAL_DESTINATION).resolve()
+    for obj in objects:
@@
-        relative_path = obj.object_name[len(S3_SOURCE):]
-        local_file_path = os.path.join(LOCAL_DESTINATION, relative_path)
+        relative_path = obj.object_name[len(S3_SOURCE):].lstrip("/")
+        local_file_path = (base_dir / relative_path).resolve()
+        try:
+            local_file_path.relative_to(base_dir)
+        except ValueError:
+            print(f"[warn] Skipping suspicious key: {obj.object_name}")
+            continue
@@
-        os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
+        local_file_path.parent.mkdir(parents=True, exist_ok=True)
@@
-            client.fget_object(bucket_name, obj.object_name, local_file_path)
+            client.fget_object(bucket_name, obj.object_name, str(local_file_path))
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 13-13: First line should end with a period

Add period

(D400)


[warning] 13-13: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


[warning] 18-18: os.makedirs() should be replaced by Path.mkdir(parents=True)

(PTH103)


[warning] 30-30: os.path.join() should be replaced by Path with / operator

(PTH118)


[warning] 33-33: os.makedirs() should be replaced by Path.mkdir(parents=True)

(PTH103)


[warning] 33-33: os.path.dirname() should be replaced by Path.parent

(PTH120)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/download_svg_from_s3.py` around lines 3 - 34,
The download_all_svg_files function currently maps S3 keys directly to
local_file_path which allows path traversal via "../"; normalize and validate
the mapped path before creating dirs or writing files by resolving the
destination root (LOCAL_DESTINATION) to an absolute/resolved path and resolving
the candidate path (join LOCAL_DESTINATION with relative_path then resolve),
then ensure the resolved candidate path starts with the resolved destination
root; if it does not, skip the object (or sanitize the name) and increment
fail_count; apply this check where obj.object_name -> relative_path ->
local_file_path is computed and before os.makedirs or any file write.

Comment on lines 74 to 93
print(f"Opening floor {label} ({idx+1}/{len(floor_links)}) -> {href}")
# driver.get(href) # open url in same tab
# open floor in new tab
driver.execute_script("window.open(arguments[0], '_blank');", href)
driver.switch_to.window(driver.window_handles[-1]) # Note: assumes new tab is last. Could be wrong if popups occur.

# Handle possible modal alert: "Drawing file is not available!"
try:
WebDriverWait(driver, 1).until(EC.alert_is_present())
alert = driver.switch_to.alert
msg = alert.text
alert.accept()
print(f"Skipped floor {label or '(no label)'} due to alert: {msg}")
except TimeoutException:
# No alert -> give the site a moment to “warm” server-side state
time.sleep(0.5)

# close tab and return to main page
driver.close()
driver.switch_to.window(home_window)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/driver.py | sed -n '70,100p'

Repository: ScottyLabs/cmumaps

Length of output: 1613


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/driver.py | sed -n '50,95p'

Repository: ScottyLabs/cmumaps

Length of output: 2234


🏁 Script executed:

python3 << 'EOF'
# Research on Selenium window handle behavior
info = """
Selenium window.open() behavior:

1. execute_script() with window.open() is a JavaScript call that:
   - Returns immediately (async operation)
   - Can fail silently if blocked by browser security policies
   - Does not raise an exception if opening fails
   - Only adds to window_handles if actually created

2. window_handles list:
   - Contains all open windows/tabs at that moment
   - Does not automatically refresh
   - Must be accessed fresh each time to get updated list
   - window_handles[-1] assumes the last element is the new tab

3. Common failure scenarios:
   - Content Security Policy (CSP) blocking window.open
   - Browser security policies on certain domains
   - Target origin security restrictions
   - WebDriver security sandboxing

4. The vulnerability:
   - If window.open() fails silently, window_handles unchanged
   - window_handles[-1] still references main window
   - driver.close() closes main window instead of new tab
   - Next iteration attempts to interact with closed window (error)

5. The proposed fix correctly:
   - Captures handles_before as set()
   - Waits for new handle to appear
   - Raises TimeoutException if no new handle appears
   - Uses identified new_handles to switch to correct window
   - Skips close when no new window was created
"""
print(info)
EOF

Repository: ScottyLabs/cmumaps

Length of output: 1352


Guard against missing new‑tab handles before closing.
If window.open() fails or is blocked by browser security policies, window_handles[-1] can still reference the main tab and driver.close() will terminate the session. Wait for a new handle to appear and skip the close operation if none materializes.

🔧 Proposed guard for new‑tab handle creation
-        driver.execute_script("window.open(arguments[0], '_blank');", href)
-        driver.switch_to.window(driver.window_handles[-1]) # Note: assumes new tab is last. Could be wrong if popups occur.
+        handles_before = set(driver.window_handles)
+        driver.execute_script("window.open(arguments[0], '_blank');", href)
+        try:
+            WebDriverWait(driver, 5).until(lambda d: len(d.window_handles) > len(handles_before))
+        except TimeoutException:
+            print(f"[warn] No new tab opened for {label or '(no label)'}; skipping.")
+            continue
+        new_handles = [h for h in driver.window_handles if h not in handles_before]
+        driver.switch_to.window(new_handles[-1])
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 74-74: print found

Remove print

(T201)


[error] 75-75: Found commented-out code

Remove commented-out code

(ERA001)


[warning] 86-86: print found

Remove print

(T201)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/driver.py` around lines 74 - 93, The code
assumes window.open succeeded and immediately uses driver.window_handles[-1]
which can equal the main handle if popups are blocked, causing driver.close() to
kill the session; before switching or closing, capture the current handles
(e.g., home_window = driver.current_window_handle and orig_handles =
driver.window_handles), then after execute_script use a short WebDriverWait loop
(or poll) to ensure a new handle appears (len(driver.window_handles) >
len(orig_handles)); if a new handle never appears, skip switching/alert
handling/closing for this iteration and continue, otherwise switch to the newly
added handle, perform the alert check, close that new handle, and switch back to
home_window (use unique symbols like driver.execute_script,
driver.window_handles, driver.switch_to.window, driver.close,
WebDriverWait/TimeoutException to locate where to update).

Comment on lines 28 to 46
import argparse
import json
from pathlib import Path
import re
import time
from typing import Dict, Any, List, Optional
import requests

ENDPOINT = "https://fmsystems.cmu.edu/FMInteract/tools/getDefaultLayersData.ashx"
DEFAULT_PARAMS = {
"isRevit": "false",
"RoomBoundaryLayer": "A-AREA",
"RoomTagLayer": "A-AREA-IDEN",
}

# Paste your session cookies and headers here (copied from your browser)
cookies = {
'ASP.NET_SessionId': 'sadei2rcwbhr1m1bdzhmyejm',
'.ASPXAUTH': '6A464769C85B50B714FB55EED49E05219CB95E65E6BB946700F5FC627B8BC61F8238B4DB27B649F72C89B11E2A861D4EA9DBF4435EBD01C9E376DB64A831D29A5958CFEFCC6F24555591677D837DE81B93A42DEA',
'AuthenToken': 'e097af9e-1d96-41da-8461-dd6d68fde4e2',
'.ASPXROLES': '51oAsjfC8TogP4H6dpTFEHZ95_fxHWVjx3PpE2LLMpwm3TMl5xLf40ovYnXyH3WXPzlseJC7PFMJ_zghHwLxaOekFUKlMK7lIPsZ8Ke-g0bAPWE6I8cuwI2LmbaGDEtrZgCcT988NnrSZ-bGVM226apgdQ4VRfQspcZtyZFfRJ3gLTdwT3tkT9li_f1tP5JztF5xbMAYoiBpBpJ9e_OXDwBxRYD58ScEKavkfqbjA03ivo6WonCfIlMYE2QsOBKEqgPBPciUiyoY4ywp_f4DlgP-TW4WIISKQ9626T7EAqA3eKfWw-hYjkH6zDvnMeFw4vLv_fwkmOScoHqZliHdKmbpiKFQ1BMnoMKk92bR72QjRwAZUhqqlLkqAhkMMJL03Z4rIMH3GZ4l_eTNdZfEpvsZwBM2o9Uczhn55JNxe6AVL7e60vcojVEO9IyWeqjSx6hrfeMFCj1tK9psYyOFaki9FTW4RQvo4MxCP8Y9WBekJ2ZP0',
'NavCookies_ANONYMOUS': '1',
'NavCookies_Theme_ANONYMOUS': 'Default',
'IsShowNavigation': 'true',
'previousUrl': 'https://fmsystems.cmu.edu/FMInteract/ShowDrawingView.aspx?file_code=L0&bldgcode%20floorcode%20optn=116%20%20%20%20%20%20%201%20%20%20&bldgcode=116%20%20%20%20%20%20%20&floorcode=1%20%20%20&act_code=Q',
'State': 'unpinNav',
'scrollPosition': '1826.5',
'MenuState': 'Action=SITE&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&sitecode=MAIN*Action=BUILDING&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&BLDGCODE=016*Action=BUILDING&BLDGCODE=009*Action=BUILDING&BLDGCODE=002*Action=BUILDING&BLDGCODE=080*Action=BUILDING&BLDGCODE=190*Action=BUILDING&BLDGCODE=195*Action=BUILDING&BLDGCODE=280*Action=BUILDING&BLDGCODE=025*Action=BUILDING&BLDGCODE=141*Action=BUILDING&BLDGCODE=012*',
'R_SplitterHeight': '494',
'__AntiXsrfToken': 'f185e23462ed4be7bf12656cf127454c',
}

headers = {
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'priority': 'u=1, i',
'referer': 'https://fmsystems.cmu.edu/FMInteract/DrawingView.aspx?file_code=L0&bldgcode%20floorcode%20optn=116%20%20%20%20%20%20%201%20%20%20&bldgcode=116%20%20%20%20%20%20%20&floorcode=1%20%20%20&act_code=Q&maxColunms=4',
'sec-ch-ua': '"Not(A:Brand";v="8", "Chromium";v="144", "Google Chrome";v="144"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
# 'cookie': 'ASP.NET_SessionId=sadei2rcwbhr1m1bdzhmyejm; .ASPXAUTH=6A464769C85B50B714FB55EED49E05219CB95E65E6BB946700F5FC627B8BC61F8238B4DB27B649F72C89B11E2A861D4EA9DBF4435EBD01C9E376DB64A831D29A5958CFEFCC6F24555591677D837DE81B93A42DEA; AuthenToken=e097af9e-1d96-41da-8461-dd6d68fde4e2; .ASPXROLES=51oAsjfC8TogP4H6dpTFEHZ95_fxHWVjx3PpE2LLMpwm3TMl5xLf40ovYnXyH3WXPzlseJC7PFMJ_zghHwLxaOekFUKlMK7lIPsZ8Ke-g0bAPWE6I8cuwI2LmbaGDEtrZgCcT988NnrSZ-bGVM226apgdQ4VRfQspcZtyZFfRJ3gLTdwT3tkT9li_f1tP5JztF5xbMAYoiBpBpJ9e_OXDwBxRYD58ScEKavkfqbjA03ivo6WonCfIlMYE2QsOBKEqgPBPciUiyoY4ywp_f4DlgP-TW4WIISKQ9626T7EAqA3eKfWw-hYjkH6zDvnMeFw4vLv_fwkmOScoHqZliHdKmbpiKFQ1BMnoMKk92bR72QjRwAZUhqqlLkqAhkMMJL03Z4rIMH3GZ4l_eTNdZfEpvsZwBM2o9Uczhn55JNxe6AVL7e60vcojVEO9IyWeqjSx6hrfeMFCj1tK9psYyOFaki9FTW4RQvo4MxCP8Y9WBekJ2ZP0; NavCookies_ANONYMOUS=1; NavCookies_Theme_ANONYMOUS=Default; IsShowNavigation=true; previousUrl=https://fmsystems.cmu.edu/FMInteract/ShowDrawingView.aspx?file_code=L0&bldgcode%20floorcode%20optn=116%20%20%20%20%20%20%201%20%20%20&bldgcode=116%20%20%20%20%20%20%20&floorcode=1%20%20%20&act_code=Q; State=unpinNav; scrollPosition=1826.5; MenuState=Action=SITE&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&act_code=Q&sitecode=MAIN*Action=BUILDING&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&sitecode=MAIN&BLDGCODE=016*Action=BUILDING&BLDGCODE=009*Action=BUILDING&BLDGCODE=002*Action=BUILDING&BLDGCODE=080*Action=BUILDING&BLDGCODE=190*Action=BUILDING&BLDGCODE=195*Action=BUILDING&BLDGCODE=280*Action=BUILDING&BLDGCODE=025*Action=BUILDING&BLDGCODE=141*Action=BUILDING&BLDGCODE=012*; R_SplitterHeight=494; __AntiXsrfToken=f185e23462ed4be7bf12656cf127454c',
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove hard‑coded session cookies/headers from source control.
These are live auth tokens and must not be committed. Load them from a local file/env at runtime, add the files to .gitignore, and rotate/revoke the exposed tokens.

🔒 Example: load secrets from local JSON files
-import argparse
-import json
-from pathlib import Path
-import re
-import time
-from typing import Dict, Any, List, Optional
-import requests
+import argparse
+import json
+import os
+from pathlib import Path
+import re
+import time
+from typing import Dict, Any, List, Optional
+import requests
@@
-# Paste your session cookies and headers here (copied from your browser)
-cookies = {
-    'ASP.NET_SessionId': '...',
-    # ...
-}
-
-headers = {
-    'accept': '*/*',
-    # ...
-}
+def _load_json(path: str) -> Dict[str, str]:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+
+COOKIES_PATH = os.getenv("FMS_COOKIES_PATH", "cookies.json")
+HEADERS_PATH = os.getenv("FMS_HEADERS_PATH", "headers.json")
+cookies = _load_json(COOKIES_PATH)
+headers = _load_json(HEADERS_PATH)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 33-33: typing.Dict is deprecated, use dict instead

(UP035)


[warning] 33-33: typing.List is deprecated, use list instead

(UP035)


[error] 74-74: Found commented-out code

Remove commented-out code

(ERA001)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/get_all_svg.py` around lines 28 - 75, The file
currently exposes real auth data in the cookies and headers dicts; remove the
hard‑coded cookies and headers (the cookies and headers variables) and instead
load them at runtime from a local secrets source (e.g., environment variables or
a JSON file read by a helper like load_secrets()) and fail fast if missing;
ensure the secret file name is added to .gitignore and update any README/dev
docs to instruct how to create the local secrets file or set env vars; after
updating remove the sensitive values from history and coordinate
rotation/revocation of the exposed tokens.

Comment on lines 96 to 145
def fetch_svg(
floor_id: str,
svg_filename_param: str,
out_file: Path,
cookies: Dict[str, str],
headers: Dict[str, str],
timeout: float = 30.0,
retries: int = 3,
backoff: float = 1.5,
verify_ssl: bool = False,
) -> bool:
params = dict(DEFAULT_PARAMS)
params["floorId"] = floor_id
params["svgFile"] = svg_filename_param
print(params)

last_exc = None
for attempt in range(1, retries + 1):
try:
resp = requests.get(
ENDPOINT,
params=params,
cookies=cookies,
headers=headers,
timeout=timeout,
verify=verify_ssl,
)
if resp.status_code == 200 and resp.text.strip():
out_file.parent.mkdir(parents=True, exist_ok=True)
out_file.write_text(resp.text, encoding="utf-8")
return True
else:
print(f"[warn] HTTP {resp.status_code} for floorId={floor_id}; attempt {attempt}/{retries}")
except Exception as e:
last_exc = e
print(f"[warn] Exception on attempt {attempt}/{retries} for floorId={floor_id}: {e}")
time.sleep(backoff ** (attempt - 1))
if last_exc:
print(f"[error] Failed for floorId={floor_id}: {last_exc}")
return False

def main():
ap = argparse.ArgumentParser(description="Fetch floor-plan SVGs for buildings/floorid.")
ap.add_argument("--buildings", default="buildings.json", help="Path to buildings.json (list of {building, floorid})")
ap.add_argument("--out", default="floorplan_svg", help="Output root directory")
ap.add_argument("--delay", type=float, default=0.0, help="Seconds to sleep between requests (politeness)")
ap.add_argument("--timeout", type=float, default=30.0, help="HTTP timeout per request in seconds")
ap.add_argument("--retries", type=int, default=3, help="HTTP retries per request")
ap.add_argument("--backoff", type=float, default=1.5, help="Exponential backoff base between retries")
ap.add_argument("--verify-ssl", action="store_true", help="Verify SSL certificates (default: off)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and its size
wc -l apps/dataflow/src/scraper/fms/get_all_svg.py

Repository: ScottyLabs/cmumaps

Length of output: 110


🏁 Script executed:

# Read the relevant sections of the file to verify current state
head -150 apps/dataflow/src/scraper/fms/get_all_svg.py | tail -60

Repository: ScottyLabs/cmumaps

Length of output: 2592


🏁 Script executed:

# Check lines 218-228 mentioned in the review
sed -n '210,235p' apps/dataflow/src/scraper/fms/get_all_svg.py

Repository: ScottyLabs/cmumaps

Length of output: 1265


🏁 Script executed:

# Check the full file for any comments about SSL verification
head -100 apps/dataflow/src/scraper/fms/get_all_svg.py

Repository: ScottyLabs/cmumaps

Length of output: 6519


🏁 Script executed:

# Search for any environment or configuration that might explain the SSL verification choice
grep -n "verify" apps/dataflow/src/scraper/fms/get_all_svg.py

Repository: ScottyLabs/cmumaps

Length of output: 290


🏁 Script executed:

# Check if there are any README or comments in related files
fd -t f "(README|README.md)" apps/dataflow/src/scraper/fms/ 2>/dev/null || echo "No README found"

Repository: ScottyLabs/cmumaps

Length of output: 44


Enable SSL verification by default (opt-out).

SSL verification is currently disabled by default for requests to the CMU FMS Systems endpoint. Change the default to enable verification and provide an explicit opt-out flag instead:

  • Change verify_ssl: bool = False to verify_ssl: bool = True in the fetch_svg() function signature
  • Replace --verify-ssl flag with --insecure flag (using action="store_true")
  • Update the fetch_svg() call in main() to pass verify_ssl=not args.insecure

This applies to lines 105, 145, and 227.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 96-96: Too many arguments in function definition (9 > 5)

(PLR0913)


[warning] 96-96: Missing docstring in public function

(D103)


[error] 105-105: Boolean-typed positional argument in function definition

(FBT001)


[error] 105-105: Boolean default positional argument in function definition

(FBT002)


[warning] 110-110: print found

Remove print

(T201)


[warning] 123-123: Magic value used in comparison, consider replacing 200 with a constant variable

(PLR2004)


[warning] 127-127: Unnecessary else after return statement

Remove unnecessary else

(RET505)


[warning] 128-128: print found

Remove print

(T201)


[warning] 129-129: Do not catch blind exception: Exception

(BLE001)


[warning] 131-131: print found

Remove print

(T201)


[warning] 134-134: print found

Remove print

(T201)


[warning] 137-137: Missing docstring in public function

(D103)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/get_all_svg.py` around lines 96 - 145, Change
SSL verification to be enabled by default: in fetch_svg() set the parameter
verify_ssl: bool = True (instead of False). In main(), replace the CLI flag
--verify-ssl with an opt-out --insecure flag (action="store_true") and update
the call that invokes fetch_svg(...) to pass verify_ssl=not args.insecure so
verification is enabled unless --insecure is provided; ensure references to
verify_ssl in requests.get(..., verify=verify_ssl) remain consistent.

Comment on lines 115 to 106
resp = requests.get(
ENDPOINT,
params=params,
cookies=cookies,
headers=headers,
timeout=timeout,
verify=verify_ssl,
)
if resp.status_code == 200 and resp.text.strip():
out_file.parent.mkdir(parents=True, exist_ok=True)
out_file.write_text(resp.text, encoding="utf-8")
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd apps/dataflow/src/scraper/fms && wc -l get_all_svg.py

Repository: ScottyLabs/cmumaps

Length of output: 80


🏁 Script executed:

cd apps/dataflow/src/scraper/fms && head -150 get_all_svg.py | tail -50

Repository: ScottyLabs/cmumaps

Length of output: 2228


🏁 Script executed:

cd apps/dataflow/src/scraper/fms && cat -n get_all_svg.py | head -140

Repository: ScottyLabs/cmumaps

Length of output: 9048


🏁 Script executed:

find apps/dataflow -type f -name "*test*.py" | head -20

Repository: ScottyLabs/cmumaps

Length of output: 44


🏁 Script executed:

grep -r "get_all_svg\|fetch_svg" apps/dataflow --include="*.py" | head -20

Repository: ScottyLabs/cmumaps

Length of output: 199


🏁 Script executed:

grep -r "write_text\|\.svg" apps/dataflow --include="*.py" -B 2 -A 2 | head -40

Repository: ScottyLabs/cmumaps

Length of output: 2750


🏁 Script executed:

grep -r "Content-Type\|<svg\|svg\+" apps/dataflow --include="*.py" | head -20

Repository: ScottyLabs/cmumaps

Length of output: 1991


🏁 Script executed:

find apps/dataflow -type f -name "*.py" -exec grep -l "parse\|xml\|svg\|lxml\|ElementTree" {} \;

Repository: ScottyLabs/cmumaps

Length of output: 334


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/get_all_svg.py | sed -n '155,241p'

Repository: ScottyLabs/cmumaps

Length of output: 4089


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/get_all_svg.py | sed -n '36,42p'

Repository: ScottyLabs/cmumaps

Length of output: 308


🏁 Script executed:

grep -r "getDefaultLayersData\|fmsystems.cmu.edu" apps/dataflow --include="*.py" -B 2 -A 2

Repository: ScottyLabs/cmumaps

Length of output: 4375


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/get_all_svg.py | sed -n '1,42p'

Repository: ScottyLabs/cmumaps

Length of output: 1340


🏁 Script executed:

grep -r "svg\|SVG\|Content-Type\|image" apps/dataflow/src/scraper/fms/ --include="*.py" -i | head -30

Repository: ScottyLabs/cmumaps

Length of output: 3055


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/get_all_svg.py | sed -n '96,135p'

Repository: ScottyLabs/cmumaps

Length of output: 1690


Add SVG validation before writing response to disk.

The endpoint could return a 200 status with an HTML/login payload if the session expires (cookies are manually extracted and can become stale). Currently, this would be written as a .svg file, causing downstream parsing failures. Check the Content-Type header or validate that the response starts with <svg.

Suggested validation approach
-            if resp.status_code == 200 and resp.text.strip():
-                out_file.parent.mkdir(parents=True, exist_ok=True)
-                out_file.write_text(resp.text, encoding="utf-8")
-                return True
+            if resp.status_code == 200 and resp.text.strip():
+                ct = resp.headers.get("Content-Type", "")
+                if "image/svg+xml" not in ct and not resp.text.lstrip().startswith("<svg"):
+                    print(f"[warn] Non‑SVG response for floorId={floor_id}; attempt {attempt}/{retries}")
+                else:
+                    out_file.parent.mkdir(parents=True, exist_ok=True)
+                    out_file.write_text(resp.text, encoding="utf-8")
+                    return True
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 123-123: Magic value used in comparison, consider replacing 200 with a constant variable

(PLR2004)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/get_all_svg.py` around lines 115 - 126, The
response is written to out_file as an .svg without validating that it is
actually SVG; update the logic in the function using resp (the requests.get
block) to verify the payload before calling out_file.write_text: check
resp.headers.get("Content-Type") for an SVG MIME type (e.g., contains
"image/svg+xml") or ensure resp.text.lstrip().startswith("<svg") and only then
create the parent dir and write the file, otherwise treat it as an
error/expired-session case (return False or raise) so stale HTML/login pages are
not saved as .svg.

Comment on lines 1 to 5
from minio import Minio
from dotenv import load_dotenv
import os
import json

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create parent directories before writing local JSON.
If s3_object_name includes a path (e.g., foo/bar.json), the write will fail unless the parent directory exists.

📁 Ensure the local directory exists
 from dotenv import load_dotenv
 import os
 import json
+from pathlib import Path
@@
-    local_path = local_file_path if local_file_path else s3_object_name
+    local_path = local_file_path if local_file_path else s3_object_name
+    Path(local_path).parent.mkdir(parents=True, exist_ok=True)

Also applies to: 112-118

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 1-1: Missing docstring in public module

(D100)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/s3_utils.py` around lines 1 - 5, The code that
writes a downloaded S3 object to local JSON must ensure the parent directory
exists before calling open/write; update the function that saves the S3 object
(the routine that takes s3_object_name and writes to local_path /
local_filename—e.g., the save/download function around lines 112-118) to call
os.path.dirname(local_path) and, if non-empty, create it with os.makedirs(...,
exist_ok=True) prior to opening the file; keep the existing write logic and
handle the case where s3_object_name has no path component by skipping directory
creation.

Comment on lines +8 to +21
access_key = os.getenv("S3_ACCESS_KEY")
secret_key = os.getenv("S3_SECRET_KEY")
s3_endpoint = os.getenv("S3_ENDPOINT")

# Create client with access and secret key.
client = Minio(
s3_endpoint,
access_key=access_key,
secret_key=secret_key,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "s3_utils.py" | head -5

Repository: ScottyLabs/cmumaps

Length of output: 105


🏁 Script executed:

cat -n apps/dataflow/src/scraper/fms/s3_utils.py | head -50

Repository: ScottyLabs/cmumaps

Length of output: 1692


🏁 Script executed:

# Check if there are other places where Minio client is instantiated
rg "Minio\(" --type py -B 2 -A 5

Repository: ScottyLabs/cmumaps

Length of output: 1041


🌐 Web query:

MinIO Python client None values initialization requirements

💡 Result:

For the official MinIO Python SDK (minio / minio-py), the constructor explicitly allows several arguments to be None:

Minio(endpoint, access_key=None, secret_key=None, session_token=None,
      secure=True, region=None, http_client=None, credentials=None)

[1]

What can be None (and when)

  • endpoint: cannot be None (it’s the only truly required argument) [1]
  • access_key, secret_key, session_token: may be None (they’re optional in the API) [1]
    • If you want authenticated access, docs/examples treat access_key + secret_key as required inputs [2]
    • If you omit them (leave as None), you’re effectively doing anonymous / unsigned requests (only works where the server/bucket/policy allows it)
  • region, http_client, credentials: may be None [1]
    • If you use a credentials= provider, you typically don’t pass access_key/secret_key (leave them None) [1]

What should not be None

  • secure: should be a boolean (True/False), not None [1]

Sources: MinIO Python Client API constructor signature [1] and the official quickstart/README stating endpoint, access_key, and secret_key are required to connect in the standard authenticated setup [2].

[1] (min.io)
[2] (github.com)

Citations:


🏁 Script executed:

cat -n apps/dataflow/src/clients/_s3_client.py | head -80

Repository: ScottyLabs/cmumaps

Length of output: 2818


🏁 Script executed:

# Get more context around the validation in _s3_client.py
sed -n '1,50p' apps/dataflow/src/clients/_s3_client.py

Repository: ScottyLabs/cmumaps

Length of output: 1397


Validate S3_ENDPOINT before initializing the Minio client.

The client initialization will fail with an opaque error if S3_ENDPOINT is None. The _s3_client.py module already demonstrates this pattern with explicit endpoint validation. While MinIO allows access_key and secret_key to be None (for anonymous access), the endpoint parameter is required.

🧯 Add endpoint validation
 access_key = os.getenv("S3_ACCESS_KEY")
 secret_key = os.getenv("S3_SECRET_KEY")
 s3_endpoint = os.getenv("S3_ENDPOINT")
+
+if not s3_endpoint:
+    raise ValueError("S3_ENDPOINT environment variable must be set")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
access_key = os.getenv("S3_ACCESS_KEY")
secret_key = os.getenv("S3_SECRET_KEY")
s3_endpoint = os.getenv("S3_ENDPOINT")
# Create client with access and secret key.
client = Minio(
s3_endpoint,
access_key=access_key,
secret_key=secret_key,
)
access_key = os.getenv("S3_ACCESS_KEY")
secret_key = os.getenv("S3_SECRET_KEY")
s3_endpoint = os.getenv("S3_ENDPOINT")
if not s3_endpoint:
raise ValueError("S3_ENDPOINT environment variable must be set")
# Create client with access and secret key.
client = Minio(
s3_endpoint,
access_key=access_key,
secret_key=secret_key,
)
🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/s3_utils.py` around lines 8 - 17, The code
currently constructs a Minio client without checking s3_endpoint; before calling
Minio(...) validate that s3_endpoint (the S3_ENDPOINT env var) is non-empty and
raise or log a clear error (e.g., ValueError with a message like "S3_ENDPOINT
must be set") so initialization of client = Minio(...) only occurs when
s3_endpoint is present; reference the variables access_key, secret_key,
s3_endpoint and the Minio client creation in your change.

Comment on lines 17 to 22
# Iterate through building folders
for building in os.listdir(LOCAL_SVG_FOLDER):
building_path = os.path.join(LOCAL_SVG_FOLDER, building)

# Skip if not a directory
if not os.path.isdir(building_path):
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast when the local SVG folder is missing.
os.listdir(LOCAL_SVG_FOLDER) will raise if the folder isn’t present; a clear error helps.

🧯 Add an explicit existence check
 def upload_all_svg_files():
     """Upload all SVG files from floorplan_svg folder to S3, preserving folder structure"""
     success_count = 0
     fail_count = 0
+
+    if not os.path.isdir(LOCAL_SVG_FOLDER):
+        raise FileNotFoundError(f"Local SVG folder not found: {LOCAL_SVG_FOLDER}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Iterate through building folders
for building in os.listdir(LOCAL_SVG_FOLDER):
building_path = os.path.join(LOCAL_SVG_FOLDER, building)
# Skip if not a directory
if not os.path.isdir(building_path):
continue
if not os.path.isdir(LOCAL_SVG_FOLDER):
raise FileNotFoundError(f"Local SVG folder not found: {LOCAL_SVG_FOLDER}")
# Iterate through building folders
for building in os.listdir(LOCAL_SVG_FOLDER):
building_path = os.path.join(LOCAL_SVG_FOLDER, building)
# Skip if not a directory
if not os.path.isdir(building_path):
continue
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 18-18: Use pathlib.Path.iterdir() instead.

(PTH208)


[warning] 19-19: os.path.join() should be replaced by Path with / operator

(PTH118)


[warning] 22-22: os.path.isdir() should be replaced by Path.is_dir()

(PTH112)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/upload_svg_to_s3.py` around lines 17 - 23, The
code calls os.listdir(LOCAL_SVG_FOLDER) which will raise if LOCAL_SVG_FOLDER
doesn't exist; before the loop check that LOCAL_SVG_FOLDER exists and is a
directory (using os.path.exists or os.path.isdir) and if not, fail fast by
raising a clear exception or logging an error and exiting; update the block
surrounding the os.listdir call (the loop that iterates over building in
os.listdir(LOCAL_SVG_FOLDER)) to perform this pre-check and surface a
descriptive message including LOCAL_SVG_FOLDER so downstream callers know why
processing stopped.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@apps/dataflow/src/scraper/fms/driver.py`:
- Around line 58-60: Remove the unused "# noqa: E501" comment fragments from the
long XPath string literals in driver.py (the expressions that include
f"//a[contains(normalize-space(.), '{building_label}')]" and the similar
multi-line XPath later around lines 77-79); simply delete the " # noqa: E501"
text so the string concatenation remains identical but without the unused noqa
comments that trigger Ruff CI failures.

In `@apps/dataflow/src/scraper/fms/get_all_svg.py`:
- Around line 1-20: Several lines in get_all_svg.py exceed the E501 length
limit; wrap long string literals and f-strings (especially in the cookies and
headers dicts and any request/logging calls) using implicit string literal
concatenation or parentheses and split long f-strings across concatenated pieces
so each line stays under the limit. Locate the cookies and headers dicts and any
long f-strings used in request URLs or log messages (e.g., where you build
request headers, URLs, or logging statements around functions that fetch SVGs)
and break them into multiple shorter literals or use parentheses to format the
string across lines while preserving the same runtime value. Ensure indentation
and f-expression braces remain correct when splitting and run flake8 locally to
confirm E501 is resolved.

Comment on lines 58 to 60
f"//a[contains(normalize-space(.), '{building_label}')]" # noqa: E501
"/ancestor::div[1]/following-sibling::ul"
"//div[contains(@class,'fmi-nav-floorplan')]/a[contains(@class,'rtIn')]",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused # noqa: E501 to fix CI failures.

Ruff reports these noqa are unused.

🧹 Cleanup
-            f"//a[contains(normalize-space(.), '{building_label}')]"  # noqa: E501
+            f"//a[contains(normalize-space(.), '{building_label}')]"
@@
-            f"//a[contains(normalize-space(.), '{building_label}')]"  # noqa: E501
+            f"//a[contains(normalize-space(.), '{building_label}')]"

Also applies to: 77-79

🧰 Tools
🪛 GitHub Actions: Check

[error] 58-58: RUF100: Unused noqa directive (unused: E501)

🪛 Ruff (0.14.14)

[warning] 58-58: Unused noqa directive (unused: E501)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/driver.py` around lines 58 - 60, Remove the
unused "# noqa: E501" comment fragments from the long XPath string literals in
driver.py (the expressions that include f"//a[contains(normalize-space(.),
'{building_label}')]" and the similar multi-line XPath later around lines
77-79); simply delete the " # noqa: E501" text so the string concatenation
remains identical but without the unused noqa comments that trigger Ruff CI
failures.

Comment on lines 1 to 21
"""Pipeline: Fetch CMU floor-plan SVGs for a set of buildings.

1. Run driver.py to activate the session for each floor and get cookies/headers.

2. Put a buildings.json file in the current directory with the following format:
Input JSON format (buildings.json):
[
{
"building": "CFA",
"floorid": {
"1": "...",
"2": "...",
...
}
}
]

3. Update the `cookies` and `headers` dicts in this script with the values copied from your browser session.

4. Run this script to fetch all SVGs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap long lines to fix E501 CI failures.

CI flags these lines; please wrap strings/f‑strings to keep within the line length limit.

🧰 Example fix (apply similarly to the other E501 lines)
-3. Update the `cookies` and `headers` dicts in this script with the values copied from your browser session.
+3. Update the `cookies` and `headers` dicts in this script with the values
+   copied from your browser session.
@@
-            print(  # noqa: T201
-                f"[warn] Exception on attempt {attempt}/{retries} for floorId={floor_id}: {e}",
-            )
+            print(  # noqa: T201
+                f"[warn] Exception on attempt {attempt}/{retries} "
+                f"for floorId={floor_id}: {e}",
+            )
@@
-            print(  # noqa: T201
-                f"[fetch] {building_name} | floor {floor_num} -> {out_file.name} (id={floor_id})",
-            )
+            print(  # noqa: T201
+                f"[fetch] {building_name} | floor {floor_num} -> "
+                f"{out_file.name} (id={floor_id})",
+            )
@@
-                print(  # noqa: T201
-                    f"[fail] Could not fetch floor {floor_num} (id={floor_id}) for {building_name}",
-                )
+                print(  # noqa: T201
+                    f"[fail] Could not fetch floor {floor_num} (id={floor_id}) "
+                    f"for {building_name}",
+                )

Also applies to: 106-114, 240-244, 262-263

🧰 Tools
🪛 GitHub Actions: Check

[error] 18-18: E501 Line too long (96 > 88)

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/get_all_svg.py` around lines 1 - 20, Several
lines in get_all_svg.py exceed the E501 length limit; wrap long string literals
and f-strings (especially in the cookies and headers dicts and any
request/logging calls) using implicit string literal concatenation or
parentheses and split long f-strings across concatenated pieces so each line
stays under the limit. Locate the cookies and headers dicts and any long
f-strings used in request URLs or log messages (e.g., where you build request
headers, URLs, or logging statements around functions that fetch SVGs) and break
them into multiple shorter literals or use parentheses to format the string
across lines while preserving the same runtime value. Ensure indentation and
f-expression braces remain correct when splitting and run flake8 locally to
confirm E501 is resolved.

Comment on lines +5 to +9
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The script driver.py imports selenium but it is not listed as a project dependency, which will cause an ImportError when the script is run.
Severity: MEDIUM

Suggested Fix

Add selenium to the list of dependencies in the pyproject.toml file to ensure it is installed along with other project requirements. This will allow the driver.py script to be executed successfully.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: apps/dataflow/src/scraper/fms/driver.py#L5-L9

Potential issue: The script `driver.py` imports the `selenium` package at lines 5-9, but
`selenium` is not declared as a dependency in the `pyproject.toml` file. The script is
documented as a standalone utility to be run manually by users. When a user follows the
instructions and attempts to execute the script, it will immediately fail with an
`ImportError` because the required package is not installed. This prevents the script
from being used as intended.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@apps/dataflow/src/scraper/fms/download_svg_from_s3.py`:
- Line 5: The import of s3_utils is failing because it’s treated as an absolute
module; update the import in download_svg_from_s3.py to use a relative import
from the same directory (import the symbols bucket_name and client from the
local s3_utils module) — i.e., change the current absolute import of s3_utils to
a relative import for bucket_name and client used in this file (or alternatively
make this directory a proper package with __init__.py or modify sys.path when
running standalone so s3_utils can be resolved).

In `@apps/dataflow/src/scraper/fms/driver.py`:
- Around line 112-115: The function js_click currently lacks a parameter type
which triggers no-untyped-def; add a type annotation for the el parameter (e.g.
use typing.Any or selenium.webdriver.remote.webelement.WebElement) and import
the chosen symbol (Any from typing or WebElement from
selenium.webdriver.remote.webelement) at the top of the module; leave the
function body as-is using driver.execute_script and ensure the signature becomes
def js_click(el: Any) -> None: or def js_click(el: WebElement) -> None: so CI
stops flagging the untyped definition.
- Around line 5-9: The project is missing the selenium dependency required by
the imports (selenium, webdriver, TimeoutException, WebDriverWait,
expected_conditions/ec) used in apps/dataflow/src/scraper/fms/driver.py; add a
compatible selenium entry to the project's dependency list in pyproject.toml
(e.g., under [tool.poetry.dependencies] or the project's dependency section) and
pin a suitable version so the scrape-fms command can import selenium
successfully.

In `@apps/dataflow/src/scraper/fms/s3_utils.py`:
- Around line 217-227: The docstring for get_generic_file_from_s3 incorrectly
mentions a non-existent return_data parameter; update the docstring for the
get_generic_file_from_s3 function to only list the actual parameter
s3_object_name and remove any references to return_data, and correct the Returns
section to accurately describe what the function returns (e.g., the raw S3
response or parsed object as Any | None) so the documentation matches the
function signature and behavior.

In `@apps/dataflow/src/scraper/fms/upload_svg_to_s3.py`:
- Line 5: The import in upload_svg_to_s3.py is causing CI failures; change the
bare module import to the same relative import style used in
download_svg_from_s3.py so the package resolver can find s3_utils. Replace "from
s3_utils import upload_generic_file" with the appropriate relative import (e.g.,
"from .s3_utils import upload_generic_file" or "from ..s3_utils import
upload_generic_file" to match the package layout) so the upload_generic_file
symbol is resolved at runtime.
🧹 Nitpick comments (2)
apps/dataflow/src/scraper/fms/driver.py (1)

17-23: Consider moving driver instantiation inside a main block.

The WebDriver is instantiated at module import time, which means importing this module will launch Chrome. This could cause issues if the module is imported for testing or by other scripts. However, this is acceptable for a standalone utility script.

apps/dataflow/src/scraper/fms/s3_utils.py (1)

56-61: Note: upload_folder only uploads top-level files, not subdirectories.

The function skips directories (line 60-61), so nested folder structures won't be uploaded. If recursive upload is needed, consider using os.walk() or documenting this limitation.


from pathlib import Path

from s3_utils import bucket_name, client
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the import to resolve CI failure.

The pipeline reports import-not-found for s3_utils. Since s3_utils.py is in the same directory, use a relative import or ensure the package structure supports absolute imports.

🔧 Proposed fix using relative import
-from s3_utils import bucket_name, client
+from .s3_utils import bucket_name, client

Alternatively, if this script is intended to run standalone, add the directory to sys.path or restructure as a proper package with __init__.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from s3_utils import bucket_name, client
from .s3_utils import bucket_name, client
🧰 Tools
🪛 GitHub Actions: Check

[error] 5-5: Cannot find implementation or library stub for module named "s3_utils" [import-not-found]

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/download_svg_from_s3.py` at line 5, The import
of s3_utils is failing because it’s treated as an absolute module; update the
import in download_svg_from_s3.py to use a relative import from the same
directory (import the symbols bucket_name and client from the local s3_utils
module) — i.e., change the current absolute import of s3_utils to a relative
import for bucket_name and client used in this file (or alternatively make this
directory a proper package with __init__.py or modify sys.path when running
standalone so s3_utils can be resolved).

Comment on lines +5 to +9
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if selenium is listed in any dependency file
rg -l "selenium" --glob "pyproject.toml" --glob "requirements*.txt" --glob "setup.py"

Repository: ScottyLabs/cmumaps

Length of output: 44


🏁 Script executed:

#!/bin/bash
# Also check for common dependency file patterns
fd -e "toml|txt|py" -x grep -l "selenium" {} \; 2>/dev/null | head -20

Repository: ScottyLabs/cmumaps

Length of output: 44


🏁 Script executed:

#!/bin/bash
# Check the driver.py file to see the imports and other flagged issues
head -120 apps/dataflow/src/scraper/fms/driver.py | cat -n

Repository: ScottyLabs/cmumaps

Length of output: 5455


🏁 Script executed:

#!/bin/bash
# Search more comprehensively for dependency files and selenium mentions
find . -type f \( -name "setup.cfg" -o -name "tox.ini" -o -name "environment.yml" -o -name "Pipfile" -o -name "poetry.lock" -o -name "requirements-*.txt" \) | head -20

Repository: ScottyLabs/cmumaps

Length of output: 44


🏁 Script executed:

#!/bin/bash
# Check pyproject.toml content if it exists
fd "pyproject.toml" -x cat {} \;

Repository: ScottyLabs/cmumaps

Length of output: 1613


🏁 Script executed:

#!/bin/bash
# Search for any CI configuration files that might indicate the import-not-found error
fd -e "yml" -e "yaml" . -path "*/.github/workflows/*" | head -5

Repository: ScottyLabs/cmumaps

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Check if there's a requirements file or similar in the dataflow directory
find apps/dataflow -type f \( -name "requirements*" -o -name "pyproject.toml" -o -name "setup.py" \) 2>/dev/null

Repository: ScottyLabs/cmumaps

Length of output: 90


🏁 Script executed:

#!/bin/bash
# Look for GitHub Actions workflows
find .github/workflows -type f -name "*.yml" -o -name "*.yaml" 2>/dev/null | head -10

Repository: ScottyLabs/cmumaps

Length of output: 89


🏁 Script executed:

#!/bin/bash
# Search for any reference to selenium in the codebase to understand usage
rg "selenium" --type python

Repository: ScottyLabs/cmumaps

Length of output: 92


🏁 Script executed:

#!/bin/bash
# Check if there are any other missing imports or issues by looking at the full driver.py
wc -l apps/dataflow/src/scraper/fms/driver.py

Repository: ScottyLabs/cmumaps

Length of output: 105


🏁 Script executed:

#!/bin/bash
# Check the CI workflow
cat .github/workflows/check.yml

Repository: ScottyLabs/cmumaps

Length of output: 2094


🏁 Script executed:

#!/bin/bash
# Search for selenium without specifying type
rg "selenium" apps/dataflow/

Repository: ScottyLabs/cmumaps

Length of output: 513


🏁 Script executed:

#!/bin/bash
# Read the rest of driver.py
tail -110 apps/dataflow/src/scraper/fms/driver.py | cat -n

Repository: ScottyLabs/cmumaps

Length of output: 4375


Add selenium to project dependencies.

The selenium package is imported in apps/dataflow/src/scraper/fms/driver.py (lines 5-9) but is not declared in apps/dataflow/pyproject.toml. This causes import failures when running the scrape-fms command. Add selenium to the project dependencies in pyproject.toml.

🧰 Tools
🪛 GitHub Actions: Check

[error] 5-5: Cannot find implementation or library stub for module named "selenium" [import-not-found]


[error] 5-5: Cannot find implementation or library stub for module named "selenium.common.exceptions" [import-not-found]


[error] 5-5: Cannot find implementation or library stub for module named "selenium.webdriver.common.by" [import-not-found]


[error] 5-5: Cannot find implementation or library stub for module named "selenium.webdriver.support" [import-not-found]


[error] 5-5: Cannot find implementation or library stub for module named "selenium.webdriver.support.ui" [import-not-found]

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/driver.py` around lines 5 - 9, The project is
missing the selenium dependency required by the imports (selenium, webdriver,
TimeoutException, WebDriverWait, expected_conditions/ec) used in
apps/dataflow/src/scraper/fms/driver.py; add a compatible selenium entry to the
project's dependency list in pyproject.toml (e.g., under
[tool.poetry.dependencies] or the project's dependency section) and pin a
suitable version so the scrape-fms command can import selenium successfully.

Comment on lines +112 to +115
def js_click(el) -> None: # noqa: ANN001
"""Click an element using JavaScript."""
driver.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
driver.execute_script("arguments[0].click();", el)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add type annotation for the el parameter to fix CI failure.

The pipeline reports no-untyped-def for this function. Add a type hint for the el parameter.

🔧 Proposed fix
-def js_click(el) -> None:  # noqa: ANN001
+def js_click(el: Any) -> None:
     """Click an element using JavaScript."""
     driver.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
     driver.execute_script("arguments[0].click();", el)

You'll need to import Any from typing, or use WebElement from selenium.webdriver.remote.webelement for a more precise type.

🧰 Tools
🪛 GitHub Actions: Check

[error] 112-112: Function is missing a type annotation for one or more arguments [no-untyped-def]

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/driver.py` around lines 112 - 115, The function
js_click currently lacks a parameter type which triggers no-untyped-def; add a
type annotation for the el parameter (e.g. use typing.Any or
selenium.webdriver.remote.webelement.WebElement) and import the chosen symbol
(Any from typing or WebElement from selenium.webdriver.remote.webelement) at the
top of the module; leave the function body as-is using driver.execute_script and
ensure the signature becomes def js_click(el: Any) -> None: or def js_click(el:
WebElement) -> None: so CI stops flagging the untyped definition.

Comment on lines +217 to +227
def get_generic_file_from_s3(s3_object_name: str) -> Any | None: # noqa: ANN401
"""Get generic data from S3 bucket.

Args:
s3_object_name (str): The S3 object name/path
return_data (bool): If True, return the data as Python object
If False, return the raw response object

Returns:
response object

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix docstring: remove reference to non-existent return_data parameter.

The docstring documents a return_data parameter that doesn't exist in this function's signature. This appears to be copy-paste from get_json_from_s3.

📝 Proposed fix
 def get_generic_file_from_s3(s3_object_name: str) -> Any | None:  # noqa: ANN401
     """Get generic data from S3 bucket.

     Args:
         s3_object_name (str): The S3 object name/path
-        return_data (bool): If True, return the data as Python object
-                            If False, return the raw response object

     Returns:
-        response object
+        Response object, or None if retrieval fails.

     """
🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/s3_utils.py` around lines 217 - 227, The
docstring for get_generic_file_from_s3 incorrectly mentions a non-existent
return_data parameter; update the docstring for the get_generic_file_from_s3
function to only list the actual parameter s3_object_name and remove any
references to return_data, and correct the Returns section to accurately
describe what the function returns (e.g., the raw S3 response or parsed object
as Any | None) so the documentation matches the function signature and behavior.


from pathlib import Path

from s3_utils import upload_generic_file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the import to resolve CI failure.

Same issue as download_svg_from_s3.py - use a relative import or ensure proper package structure.

🔧 Proposed fix
-from s3_utils import upload_generic_file
+from .s3_utils import upload_generic_file
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from s3_utils import upload_generic_file
from .s3_utils import upload_generic_file
🧰 Tools
🪛 GitHub Actions: Check

[error] 5-5: Cannot find implementation or library stub for module named "s3_utils" [import-not-found]

🤖 Prompt for AI Agents
In `@apps/dataflow/src/scraper/fms/upload_svg_to_s3.py` at line 5, The import in
upload_svg_to_s3.py is causing CI failures; change the bare module import to the
same relative import style used in download_svg_from_s3.py so the package
resolver can find s3_utils. Replace "from s3_utils import upload_generic_file"
with the appropriate relative import (e.g., "from .s3_utils import
upload_generic_file" or "from ..s3_utils import upload_generic_file" to match
the package layout) so the upload_generic_file symbol is resolved at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant