refactor: separate statistic computation #411

tristan-f-r · 2025-10-10T06:33:29Z

We also make graph statistics lazy. Laziness isn't used in summary.py, but I assume that we'll have more computationally expensive graph statistics as SPRAS develops, especially when it can take long to compute for our larger graphs, so this also splits up statistic generation into different rules.

Most importantly, this allows us to re-use statistics by consuming specific statistics as input files.

Depends on feat!: SPRAS revision #320 for the summary statistics test (integration testing over unit testing is now required since the heavy workflow lifting is done by Snakemake).

This adds the unique spras_revision to every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic algorithms.

we also make it lazy

read-the-docs-community · 2025-10-10T06:34:25Z

Documentation build overview

📚 spras | 🛠️ Build #31218776 | 📁 Comparing 339d915 against latest (18f2cf8)

🔍 Preview build

Show files changed (4 files in total): 📝 4 modified | ➕ 0 added | ➖ 0 deleted

File	Status
genindex.html	📝 modified
fordevs/spras.analysis.html	📝 modified
fordevs/spras.config.html	📝 modified
fordevs/spras.html	📝 modified

agitter · 2025-11-07T22:47:27Z

Before I can review the implementation of the change, I need to better understand what problem we are tying to solve with the change. Where will laziness be needed in the future?

we can reuse the code for graph heuristic pruning

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

tristan-f-r · 2025-11-07T23:53:18Z

I was going to ask @ntalluri about this, since I wasn't quite sure if we will have expensive graph heuristics or not.

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

I did decouple this from analysis: summary: enabled: true, and I imagined it like this. I didn't think about that, though: would it make sense to have graph summary statistics always enabled the moment any heuristics are enabled?

agitter · 2025-11-08T04:25:01Z

There could be more than one way to design this sensibly. One would be that if heuristics are enabled in the config file, that automatically generates the graph summary table. The produces more output than requested, which is slightly undesirable.

Another could be to move the heuristic calculations inside each --parameters> subdirectory, which may be where you are headed. If that is written as a file for that one pathway, it could be consumed for heuristics (or used for heuristics and then written to disk). Later, if the graph summary table is requested, it would grab the precomputed statistics from those files in the subdirectories.

tristan-f-r · 2025-11-08T08:06:01Z

I'll mark this as a draft for now and design something in line with your second proposal.

this had incorrect behavior ?

whoops! accidentally feature-regressed

tristan-f-r and others added 26 commits July 9, 2025 13:50

feat: spras_revision

b0327a2

This adds the unique spras_revision to every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic algorithms.

style: fmt

8cec738

test: summary

5683392

docs(test_summary): mention preprocessing motivation

af90ce0

test(analysis/summary): use input from /input instead

6141874

docs(test/analysis): mention dual integration testing

440a2d4

test(analysis/summary): use test/analysis provided gold standard

d9e852b

style: fmt

abb0eb9

chore: don't repeat docs inside analysis configs

60185fc

feat: get working with cytoscape

e6bd6a0

style: fmt

f9a3081

test: remove nondet from analysis

77fc3b4

fix: get input pathways at runtime

0592850

Merge branch 'umain' into hash

0b6413d

fix: rm run

1817157

Merge branch 'main' into hash

c077d91

fix: correct for pydantic

50f2195

fix: attach spras revision inside gs_values

d3a088b

chore: drop re import

8e3b898

Merge branch 'main' into hash

1ada504

fix: correct tests

34a40ad

Merge branch 'main' into hash

5d2c6d0

Merge branch 'main' into hash

ef15781

fix: correct Snakefile

8d5019b

fix: use correct gs variable

9949572

refactor: separate statistic computation

6ec4f62

we also make it lazy

tristan-f-r added tuning Workflow-spanning algorithm tuning refactor Changes that don't actually improve anything except for code quality. labels Oct 10, 2025

fix: correct tuple assumption

9987189

fix: make undirected for determining number of connected components

c675ece

tristan-f-r marked this pull request as draft November 8, 2025 08:06

ntalluri removed the P-medium medium prirotity; this is needed for some external service or another PR label Nov 19, 2025

tristan-f-r added 4 commits January 9, 2026 18:47

Merge branch 'main' into hash

eec09f2

test: fix files

a8d71bd

Merge branch 'main' into lazy-stats

3c81d05

feat: snakemake-based summary generation

1ca730e

tristan-f-r added the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026

tristan-f-r marked this pull request as ready for review January 13, 2026 20:13

tristan-f-r added 3 commits January 13, 2026 12:19

fix(Snakefile): use parse_output for edgelist parsing

d67186d

fix: parse edgelist with rank, embed header skip inside from_edgelist

fd483c3

this had incorrect behavior ?

style: fmt

fd5046f

tristan-f-r added blocked-by-other-pr P-medium medium prirotity; this is needed for some external service or another PR and removed P-high This is a blocker for many PRs/issues/features labels Jan 13, 2026

tristan-f-r added 11 commits January 13, 2026 13:17

chore: mention statistics_files param

79cf748

apply suggestions

e12fc75

clean, fix: strip project_directory

977bf5a

fix: correct equality on not SPRAS pyproject.toml

8500bcb

chore: grammar

112db39

chore: move attach_spras_revision out of Snakefile

c7262ed

Merge branch 'main' into hash

f69a0f3

fix: properly resolve merge conflict

72e30bf

fix: undo mistaken merge conflict

c71b652

whoops! accidentally feature-regressed

chore: drop unnecessary self.datasets initialization

6b941e0

Merge branch 'hash' into lazy-stats

339d915

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: separate statistic computation #411

refactor: separate statistic computation #411

Uh oh!

tristan-f-r commented Oct 10, 2025 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

agitter commented Nov 7, 2025

Uh oh!

tristan-f-r commented Nov 7, 2025

Uh oh!

agitter commented Nov 8, 2025

Uh oh!

tristan-f-r commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: separate statistic computation #411

Are you sure you want to change the base?

refactor: separate statistic computation #411

Uh oh!

Conversation

tristan-f-r commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

read-the-docs-community bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

agitter commented Nov 7, 2025

Uh oh!

tristan-f-r commented Nov 7, 2025

Uh oh!

agitter commented Nov 8, 2025

Uh oh!

tristan-f-r commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tristan-f-r commented Oct 10, 2025 •

edited

Loading

read-the-docs-community bot commented Oct 10, 2025 •

edited

Loading