poldrack · poldrack · Dec 26, 2025 · Nov 27, 2025 · Nov 27, 2025 · Nov 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,10 @@ __pycache__
 .hypothesis
 .env
 ._*
+.snakemake
+
+# Workflow output directories
+**/simple_workflow/*/output/
 
 data
 exports

diff --git a/.gitmodules b/.gitmodules
@@ -1,4 +0,0 @@
-[submodule "my_datalad_repo"]
-	path = my_datalad_repo
-	url = ./my_datalad_repo
-	datalad-id = 74807713-a6cf-4418-9dfc-e490a881645b

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,93 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is an open-source book on building better code for science using AI, authored by Russell Poldrack. The rendered book is published at https://poldrack.github.io/BetterCodeBetterScience/.
+
+## Build Commands
+
+```bash
+# Install dependencies (uses uv package manager)
+uv pip install -r pyproject.toml
+uv pip install -e .
+
+# Build book as HTML and serve locally
+myst build --html
+npx serve _build/html
+
+# Build PDF (requires LaTeX)
+jupyter-book build book/ --builder pdflatex
+
+# Clean build artifacts
+rm -rf book/_build
+```
+
+## Testing
+
+```bash
+# Run all tests
+pytest
+
+# Run tests with coverage
+pytest --cov=src/BetterCodeBetterScience --cov-report term-missing
+
+# Run specific test modules
+pytest tests/textmining/
+pytest tests/property_based_testing/
+pytest tests/narps/
+
+# Run tests with specific markers
+pytest -m unit
+pytest -m integration
+```
+
+Test markers defined in pyproject.toml: `unit` and `integration`.
+
+## Linting and Code Quality
+
+```bash
+# Spell checking (configured in pyproject.toml)
+codespell
+
+# Python linting and formatting
+ruff check .
+ruff format .
+
+# Pre-commit hooks (runs codespell)
+pre-commit run --all-files
+```
+
+## Project Structure
+
+- `book/` - MyST markdown chapters (configured in myst.yml)
+- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
+- `tests/` - Test examples demonstrating testing concepts from the book
+- `data/` - Data files for examples
+- `scripts/` - Utility scripts
+- `_build/` - Build output (gitignored)
+
+## Key Configuration Files
+
+- `myst.yml` - MyST book configuration (table of contents, exports, site settings)
+- `pyproject.toml` - Python dependencies, pytest config, codespell settings
+- `.pre-commit-config.yaml` - Pre-commit hooks (codespell)
+
+## Contribution Guidelines
+
+- New text should be authored by a human (AI may be used to check/improve text)
+- Code examples should follow PEP8
+- Avoid introducing new dependencies when possible
+- Custom words for codespell are in `project-words.txt`
+
+## Coding guidelines
+
+## Notes for Development
+
+- Think about the problem before generating code.
+- Write code that is clean and modular. Prefer shorter functions/methods over longer ones.
+- Prefer reliance on widely used packages (such as numpy, pandas, and scikit-learn); avoid unknown packages from Github.
+- Do not include *any* code in `__init__.py` files.
+- Use pytest for testing.
+- Use functions rather than classes for tests. Use pytest fixtures to share resources between tests.
diff --git a/Makefile b/Makefile
@@ -1,12 +1,16 @@
 clean:
 	- rm -rf book/_build
 
-build: clean
-	uv run jupyter-book build book/
+build-html: clean
+	myst build --html
+	npx serve _build/html
 
-pdf:
+build-pdf:
 	jupyter-book build book/ --builder pdflatex
 
+check-links:
+	check-links	
+
 pipinstall:
 	uv pip install -r pyproject.toml
 	uv pip install -e .

diff --git a/book/extras.md b/book/extras.md
@@ -329,3 +329,12 @@ Found 422 publications containing Memory in the title
 One very nice feature of the document store is that not all records have to have the same keys; this provides a great deal of flexibility at data ingestion.  However, too much heterogeneity between documents can make the database hard to work with.  One benefit of homogeneity in the document structure is that it allows indexing, which can greatly increase the speed of queries in large document stores.  For example, if we know that we will often want to search by the `year` field, then we can add an index for this field:
 
 *MORE HERE*
+
+
+### NARPS
+
+The example comes from a paper that we published in 2020 {cite:p}`Botvinik-Nezer:2020aa`, which involved analysis of data from a large study called the Neuroimaging Analysis Replication and Prediction Study (hereafter *NARPS* for short).  The goal of this study was to identify how the results of data analysis varied between different research groups when given the same data.  A relatively large neuroimaging dataset was collected and distributed to groups of researchers, who were asked to test a set of nine hypotheses about brain activity in relation to a monetary gambling task that the participants performed during MRI scanning.  Seventy teams submitted results, which included their answers to the 9 yes/no hypotheses along with a detailed description of their analysis workflow and a number of outputs from intermediate stages of the analysis. The main finding was that there was a striking amount of variability in the results between teams, even though the raw data were identical.
+
+The workflow that I will use here starts with the results that the teams submitted, and ends with preprocessed data that are ready for further statistical analysis.  I wrote much of the original analysis code for the project, which can be found [here](https://github.com/poldrack/narps).  This code was written at the point when I was just becoming interested in software engineering practices for science, and while it represents a first step in that direction, it has *a lot* of problems. In particular, it uses the problematic *God object* anti-pattern that I mentioned in an earlier chapter.  For the purposes of this chapter I have first rewritten the analysis into a monolithic mega-script, which I will then incrementally refactor into a well-structured workflow.  I chose this example because it is relatively complex yet runs quickly on any modern laptop.  
+
+
diff --git a/book/images/simple-DAG.png b/book/images/simple-DAG.png
diff --git a/book/images/snakemake-DAG.png b/book/images/snakemake-DAG.png
diff --git a/book/software_engineering.md b/book/software_engineering.md
@@ -69,7 +69,7 @@ User stories are also useful for thinking through the potential impact of new fe
 Perhaps the most common example of violations of YAGNI comes about in the development of visualization tools.
 In this example, the developer might decide to create an visualizer to show how the original dataset is being converted into the new format, with interactive features that would allow the user to view features of individual files.
 The question that you should always ask yourself is: What user stories would this feature address? If it's difficult to come up with stories that make clear how the feature would help solve particular problems for users, then the feature is probably not needed. "If you build it, they will come" might work in baseball, but it rarely works in scientific software.
-This is the reason that one of us (RP) regularly tells his trainees to post a note in their workspace with one simple mantra: "MVP".
+This is the reason that I regularly tells my trainees to post a note in their workspace with one simple mantra: "MVP".
 
 
 ## Refactoring code
@@ -954,7 +954,7 @@ def get_subject_label(file):
         return None
 ```
 
-When one of us asked the question "Should there ever be a file path that doesn't include a subject label?", the answer was "No", meaning that this code allows what amounts to an error to occur without announcing its presence.
+When I asked the question "Should there ever be a file path that doesn't include a subject label?", the answer was "No", meaning that this code allows what amounts to an error to occur without announcing its presence.
 When we looked at the place where this function was used in the code, there was no check for whether the output was `None`, meaning that such an error would go unnoticed until it caused an error later when `subject_label` was assumed to be a string.
 Also note that the docstring for this function is misleading, as it states that a message will be printed if the return value is `None`, but no message is actually printed.
 In general, printing a message is a poor way to signal the potential presence of a problem, particularly if the code has a large amount of text output in which the message might be lost.