dataset-prep

This python package provides utilities for prepping datasets for publication, building on the Frictionless data framework and corresponding python package.

This package is currently in alpha status and provides a script for generating field-level information from a frictionless datapackage file for inclusion in a dataset readme (plain text) or accompanying data dictionary (CSV). The script assumes you have already created a datapackage to describe your dataset.

Basic Usage

Install the package from python using your preferred method (pip or uv):

pip install dataset-prep

Run the dataset-readme-info script with a path to your datapackage file. The data files referenced in the datapackage must be present at the path specified.

Note

We highly recommend running frictionless validate on your datapackage to ensure your dataset and your datapackage agree on the structure of your data!

To generate a plain-text list of fields with the descriptions in the datapackage file:

dataset-readme-info my-dataset/datapackage.json

The script will output text content to the console, which can be copied and pasted into the readme for your dataset.

To generate a CSV data dictionary with field information (description, type, name) for each resource described in the datapackage file, specify the path where the file should be generated:

dataset-readme-info my-dataset/datapackage.json --data-dictionary my-dataset/datadictionary.csv

Use the -h or --help option for script usage.

Examples

The dataset-readme-info script is generalized from one that was used to help prepare datasets from the Shakespeare and Company Project for publication.

The 2.0 version of the data published in 2025 includes a CSV data dictionary:

Koeser, Rebecca Sutton & Kotin, Joshua. (2025). Shakespeare and Company Project Datasets [Data set]. Version 2. Princeton University. https://doi.org/10.34770/kf6c-b079

The 1.2 version of the data published in 2022 includes field details in the README:

Kotin, Joshua, Koeser, Rebecca Sutton, et al. (2022). Shakespeare and Company Project Dataset: Lending Library Members, Books, Events [Data set]. Version 1.2. Princeton University. https://doi.org/10.34770/dtqa-2981

License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
dataset_prep		dataset_prep
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset-prep

Basic Usage

Examples

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

Princeton-CDH/dataset-prep

Folders and files

Latest commit

History

Repository files navigation

dataset-prep

Basic Usage

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages