Skip to content
This repository was archived by the owner on Dec 31, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
cbd99ab
remove excel section
poldrack Nov 27, 2025
2dbdf74
Merge branch 'text/datamgmt-Nov3' into text/workflows-Nov26
poldrack Nov 27, 2025
5887fa5
initial section on desiderata
poldrack Nov 28, 2025
18e0e9a
initial add
poldrack Nov 29, 2025
9b5f867
add summary function
poldrack Nov 29, 2025
67a0b25
add summary function
poldrack Nov 29, 2025
bb65968
add deps
poldrack Nov 29, 2025
3416557
initial add
poldrack Nov 29, 2025
25ae638
working version
poldrack Nov 30, 2025
ac742f6
Add words
poldrack Nov 30, 2025
aaf1818
ruff/blue fixes
poldrack Nov 30, 2025
561c3b0
fix suffix handling
poldrack Nov 30, 2025
7973dd2
fix suffix handling
poldrack Nov 30, 2025
5cf3f15
working version
poldrack Nov 30, 2025
9024ea1
minor cleanup, format with blue
poldrack Nov 30, 2025
936ea6c
initial workflow development
poldrack Dec 4, 2025
4120f1a
merge cleanup
poldrack Dec 15, 2025
5189508
Merge branch 'main' into text/workflows-Nov26
poldrack Dec 15, 2025
eb12af0
cleanup
poldrack Dec 15, 2025
1147e5f
update build cmds
poldrack Dec 15, 2025
7ad989b
Merge branch 'main' into text/workflows-Nov26
poldrack Dec 15, 2025
b9f4b22
add deps
poldrack Dec 16, 2025
05842ce
Merge branch 'main' into text/workflows-Nov26
poldrack Dec 16, 2025
3894653
initial add
poldrack Dec 16, 2025
7fa3aca
Merge branch 'main' into text/workflows-Nov26
poldrack Dec 16, 2025
98e22ac
remove 'one of us'
poldrack Dec 16, 2025
0abbf6d
add link check
poldrack Dec 17, 2025
693bd66
full data prep:
poldrack Dec 17, 2025
962cf8a
Merge branch 'main' into text/workflows-Nov26
poldrack Dec 17, 2025
3beb12d
add a few topics
poldrack Dec 19, 2025
bc85549
add deps for scrna-seq example:
poldrack Dec 19, 2025
c0434ca
add deps for scrna-seq example:
poldrack Dec 19, 2025
87ac49d
initial add
poldrack Dec 19, 2025
0541df3
initial add
poldrack Dec 20, 2025
91d26f8
initial add
poldrack Dec 20, 2025
925f1f0
formatted and ruffed, about to remove plt.show commands which halt ex…
poldrack Dec 20, 2025
1267aba
add coding prefernces
poldrack Dec 21, 2025
55bfa7f
intermediate progress
poldrack Dec 21, 2025
10b4c99
add deps
poldrack Dec 21, 2025
84138d3
add deps
poldrack Dec 21, 2025
943e3c1
final version of monolithic workflow
poldrack Dec 21, 2025
678983e
initial add
poldrack Dec 21, 2025
9b1daeb
cleanup
poldrack Dec 21, 2025
5ad1df9
Add stateless workflow with checkpointing and execution logging
poldrack Dec 21, 2025
e7d52c4
Fix checkpoint logging to correctly handle interrupts during save
poldrack Dec 21, 2025
8df37e5
Optimize checkpoint storage and add selective checkpointing
poldrack Dec 22, 2025
319ccc4
Fix: Restore layers['counts'] for HVG selection, delete after step 4
poldrack Dec 22, 2025
20f2e71
Add steps 9-11 to default checkpoint steps
poldrack Dec 22, 2025
70ad51b
Add step 8 (differential expression) to default checkpoint steps
poldrack Dec 22, 2025
c68aa0f
finished checkpointed workflow section
poldrack Dec 22, 2025
ad9381a
Add Prefect-based workflow with parallel per-cell-type analysis
poldrack Dec 22, 2025
def34e7
Remove parallelization from Prefect workflow to reduce memory usage
poldrack Dec 22, 2025
13e7028
Pin numba<0.63 to fix pynndescent compatibility issue
poldrack Dec 22, 2025
6fe3ec6
Update uv.lock for numba version constraint
poldrack Dec 22, 2025
2154feb
Add execution logging to Prefect workflow
poldrack Dec 22, 2025
532a113
Disable numba JIT to fix Prefect/pynndescent compatibility
poldrack Dec 22, 2025
8b3a383
Pre-warm numba before Prefect import to fix JIT compatibility
poldrack Dec 22, 2025
6f8e524
Set NUMBA_CAPTURED_ERRORS=old_style to fix pynndescent print issue
poldrack Dec 22, 2025
07d5a24
set log_prints=False to fix numba issue
poldrack Dec 22, 2025
70764cd
Add Snakemake workflow for scRNA-seq analysis
poldrack Dec 23, 2025
f2052d2
Fix feature_name KeyError in Snakemake pseudobulk step
poldrack Dec 23, 2025
e7106a7
Handle missing feature_name column in pseudobulk step
poldrack Dec 23, 2025
b81dfb6
Use step 2 (filtered) checkpoint for var_to_feature mapping
poldrack Dec 23, 2025
32f3428
Add thread specifications to compute-intensive Snakemake rules
poldrack Dec 23, 2025
2ba2d78
Create output directories in Snakemake scripts
poldrack Dec 23, 2025
a531c98
initial add
poldrack Dec 23, 2025
263bdde
intermediate progress, full draft of workflow engine intro
poldrack Dec 23, 2025
af0ce0b
clean up immutability section
poldrack Dec 23, 2025
67c95e6
initial add
poldrack Dec 23, 2025
09b3ff9
Add comprehensive workflow overview documentation
poldrack Dec 23, 2025
9461cda
update deps
poldrack Dec 23, 2025
4e455a2
Extract Prefect workflow params to config file, update output folders
poldrack Dec 23, 2025
bb87992
Add workflow documentation with usage examples
poldrack Dec 23, 2025
4872623
clean up discussion of engines
poldrack Dec 23, 2025
63cccd4
Add Snakemake report generation for workflow results
poldrack Dec 23, 2025
65d6706
Add simple workflow example for Prefect, Snakemake, and Make
poldrack Dec 23, 2025
2a13fd2
add reporting cmd
poldrack Dec 23, 2025
78ce76f
fix merge conflict
poldrack Dec 23, 2025
c37360e
fix paths
poldrack Dec 23, 2025
daebf73
Fix missing doublet UMAP visualization in QC step
poldrack Dec 23, 2025
a07cecb
Fix report caption paths in Snakemake rule files
poldrack Dec 23, 2025
de06d21
Fix Snakemake scripts to use named output references
poldrack Dec 23, 2025
7dcc0a1
file naming
poldrack Dec 24, 2025
dd97412
simplify workflow
poldrack Dec 24, 2025
066be67
clean up figure
poldrack Dec 24, 2025
584e0c1
add return values
poldrack Dec 24, 2025
7e87b67
first draft of simple workflows
poldrack Dec 24, 2025
89a5ab2
initial add
poldrack Dec 24, 2025
40f8edd
initial add
poldrack Dec 24, 2025
08ce603
remove .snakemake before running
poldrack Dec 24, 2025
bfc5de0
add expanduser to deal with relative paths
poldrack Dec 24, 2025
0aee557
add log dir and onstart
poldrack Dec 24, 2025
71f286f
use DATADIR var instead of hard coding
poldrack Dec 24, 2025
3206f8d
add .snakemake
poldrack Dec 24, 2025
f7e942c
close to a first draft. merging for now, will return to finish later
poldrack Dec 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ __pycache__
.hypothesis
.env
._*
.snakemake

# Workflow output directories
**/simple_workflow/*/output/

data
exports
Expand Down
4 changes: 0 additions & 4 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,4 +0,0 @@
[submodule "my_datalad_repo"]
path = my_datalad_repo
url = ./my_datalad_repo
datalad-id = 74807713-a6cf-4418-9dfc-e490a881645b
93 changes: 93 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is an open-source book on building better code for science using AI, authored by Russell Poldrack. The rendered book is published at https://poldrack.github.io/BetterCodeBetterScience/.

## Build Commands

```bash
# Install dependencies (uses uv package manager)
uv pip install -r pyproject.toml
uv pip install -e .

# Build book as HTML and serve locally
myst build --html
npx serve _build/html

# Build PDF (requires LaTeX)
jupyter-book build book/ --builder pdflatex

# Clean build artifacts
rm -rf book/_build
```

## Testing

```bash
# Run all tests
pytest

# Run tests with coverage
pytest --cov=src/BetterCodeBetterScience --cov-report term-missing

# Run specific test modules
pytest tests/textmining/
pytest tests/property_based_testing/
pytest tests/narps/

# Run tests with specific markers
pytest -m unit
pytest -m integration
```

Test markers defined in pyproject.toml: `unit` and `integration`.

## Linting and Code Quality

```bash
# Spell checking (configured in pyproject.toml)
codespell

# Python linting and formatting
ruff check .
ruff format .

# Pre-commit hooks (runs codespell)
pre-commit run --all-files
```

## Project Structure

- `book/` - MyST markdown chapters (configured in myst.yml)
- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
- `tests/` - Test examples demonstrating testing concepts from the book
- `data/` - Data files for examples
- `scripts/` - Utility scripts
- `_build/` - Build output (gitignored)

## Key Configuration Files

- `myst.yml` - MyST book configuration (table of contents, exports, site settings)
- `pyproject.toml` - Python dependencies, pytest config, codespell settings
- `.pre-commit-config.yaml` - Pre-commit hooks (codespell)

## Contribution Guidelines

- New text should be authored by a human (AI may be used to check/improve text)
- Code examples should follow PEP8
- Avoid introducing new dependencies when possible
- Custom words for codespell are in `project-words.txt`

## Coding guidelines

## Notes for Development

- Think about the problem before generating code.
- Write code that is clean and modular. Prefer shorter functions/methods over longer ones.
- Prefer reliance on widely used packages (such as numpy, pandas, and scikit-learn); avoid unknown packages from Github.
- Do not include *any* code in `__init__.py` files.
- Use pytest for testing.
- Use functions rather than classes for tests. Use pytest fixtures to share resources between tests.
10 changes: 7 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
clean:
- rm -rf book/_build

build: clean
uv run jupyter-book build book/
build-html: clean
myst build --html
npx serve _build/html

pdf:
build-pdf:
jupyter-book build book/ --builder pdflatex

check-links:
check-links

pipinstall:
uv pip install -r pyproject.toml
uv pip install -e .
Expand Down
9 changes: 9 additions & 0 deletions book/extras.md
Original file line number Diff line number Diff line change
Expand Up @@ -329,3 +329,12 @@ Found 422 publications containing Memory in the title
One very nice feature of the document store is that not all records have to have the same keys; this provides a great deal of flexibility at data ingestion. However, too much heterogeneity between documents can make the database hard to work with. One benefit of homogeneity in the document structure is that it allows indexing, which can greatly increase the speed of queries in large document stores. For example, if we know that we will often want to search by the `year` field, then we can add an index for this field:

*MORE HERE*


### NARPS

The example comes from a paper that we published in 2020 {cite:p}`Botvinik-Nezer:2020aa`, which involved analysis of data from a large study called the Neuroimaging Analysis Replication and Prediction Study (hereafter *NARPS* for short). The goal of this study was to identify how the results of data analysis varied between different research groups when given the same data. A relatively large neuroimaging dataset was collected and distributed to groups of researchers, who were asked to test a set of nine hypotheses about brain activity in relation to a monetary gambling task that the participants performed during MRI scanning. Seventy teams submitted results, which included their answers to the 9 yes/no hypotheses along with a detailed description of their analysis workflow and a number of outputs from intermediate stages of the analysis. The main finding was that there was a striking amount of variability in the results between teams, even though the raw data were identical.

The workflow that I will use here starts with the results that the teams submitted, and ends with preprocessed data that are ready for further statistical analysis. I wrote much of the original analysis code for the project, which can be found [here](https://github.com/poldrack/narps). This code was written at the point when I was just becoming interested in software engineering practices for science, and while it represents a first step in that direction, it has *a lot* of problems. In particular, it uses the problematic *God object* anti-pattern that I mentioned in an earlier chapter. For the purposes of this chapter I have first rewritten the analysis into a monolithic mega-script, which I will then incrementally refactor into a well-structured workflow. I chose this example because it is relatively complex yet runs quickly on any modern laptop.


Binary file added book/images/simple-DAG.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/images/snakemake-DAG.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions book/software_engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ User stories are also useful for thinking through the potential impact of new fe
Perhaps the most common example of violations of YAGNI comes about in the development of visualization tools.
In this example, the developer might decide to create an visualizer to show how the original dataset is being converted into the new format, with interactive features that would allow the user to view features of individual files.
The question that you should always ask yourself is: What user stories would this feature address? If it's difficult to come up with stories that make clear how the feature would help solve particular problems for users, then the feature is probably not needed. "If you build it, they will come" might work in baseball, but it rarely works in scientific software.
This is the reason that one of us (RP) regularly tells his trainees to post a note in their workspace with one simple mantra: "MVP".
This is the reason that I regularly tells my trainees to post a note in their workspace with one simple mantra: "MVP".


## Refactoring code
Expand Down Expand Up @@ -954,7 +954,7 @@ def get_subject_label(file):
return None
```

When one of us asked the question "Should there ever be a file path that doesn't include a subject label?", the answer was "No", meaning that this code allows what amounts to an error to occur without announcing its presence.
When I asked the question "Should there ever be a file path that doesn't include a subject label?", the answer was "No", meaning that this code allows what amounts to an error to occur without announcing its presence.
When we looked at the place where this function was used in the code, there was no check for whether the output was `None`, meaning that such an error would go unnoticed until it caused an error later when `subject_label` was assumed to be a string.
Also note that the docstring for this function is misleading, as it states that a message will be printed if the return value is `None`, but no message is actually printed.
In general, printing a message is a poor way to signal the potential presence of a problem, particularly if the code has a large amount of text output in which the message might be lost.
Expand Down
Loading