🧠 PRISM to DataLad Conversion Tool

[](https://github.com/MRI-Lab-Graz/## 📁 Repository Structure

datalad/
├── prism2datalad.sh           # Main conversion script
├── gz_header_cleaner.py      # GZIP header cleaning tool (standalone)
├── compress_sourcedata.sh    # Helper script for data compression
├── README.md                 # This file
└── test/                     # Test datasets
    └── 107/                  # Example BIDS dataset

Main Scripts

prism2datalad.sh - Main conversion script with full DataLad integration
gz_header_cleaner.py - Standalone tool for cleaning GZIP headers (also integrated in main script)
compress_sourcedata.sh - Helper script for compressing BIDS datasets

🔧 Usage(https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A robust, production-ready script for converting BIDS-formatted MRI datasets into DataLad superdatasets with subject-level subdatasets. This tool ensures data integrity through comprehensive validation and verification processes while optimizing storage efficiency using git-annex.

✨ Features

🔍 BIDS Validation - Automatic validation using the official BIDS validator with custom configuration support
📂 Hierarchical DataLad Structure - Creates superdatasets with subject-level subdatasets
🗂️ Git-Annex Storage Optimization - Eliminates file duplication using git-annex
🔐 Comprehensive Data Integrity - SHA-256 checksum verification of every file
🧹 GZIP Header Cleaning - Automatic removal of problematic GZIP metadata for BIDS compliance (enabled by default)
⚡ Fasttrack Mode - Skip checksum validation for faster conversions
🚀 Performance Optimizations - Parallel hash calculation and progress monitoring
🔍 Real-time Transparency - All processes run in foreground with live status updates
🧪 Dry Run Mode - Preview operations without making changes
💾 Smart Backup System - Optional backup creation for destination (not needed for source)
📊 Detailed Logging - Comprehensive logs with timestamps and progress tracking
🛡️ Robust Error Handling - Cross-platform compatibility and dependency checking
⚡ Progress Tracking - Real-time progress bars for all file operations
🔒 Production Safety - Atomic operations, lock files, and comprehensive error recovery
🌐 System Validation - Network, filesystem, and dependency checking
💻 Resource Monitoring - Disk space, memory, and CPU monitoring
🔄 Atomic Operations - Fail-safe copying with rollback capabilities
🧹 Smart .DS_Store Cleanup - Automatic exclusion of macOS system files
🌐 SSH Remote Support - Copy from/to SSH remote servers seamlessly
🔄 Incremental Updates - Add new subjects to existing DataLad datasets

🏭 Production-Ready Features

Enterprise-Grade Safety

Atomic Operations - All-or-nothing conversions with automatic rollback
Lock File Management - Prevents concurrent execution conflicts
Comprehensive Pre-flight Checks - Validates all requirements before starting
Graceful Error Recovery - Automatic cleanup on interruption or failure
Resource Monitoring - Real-time tracking of disk space, memory, and CPU usage

Data Integrity Assurance

SHA-256 Verification - Every file is checksummed before and after conversion
Git-Annex Integrity Checks - Verifies storage optimization is working correctly
Progress Monitoring - Real-time progress tracking with detailed logging
Checkpoint System - Resume capability with detailed progress tracking

🚀 Quick Start

# Basic conversion with git-annex optimization
./prism2datalad.sh -s /path/to/bids_rawdata -d /path/to/datalad_destination

# Fast conversion without checksum validation (recommended for large datasets)
./prism2datalad.sh --fasttrack -s /path/to/bids_rawdata -d /path/to/datalad_destination

# With all safety features enabled
./prism2datalad.sh --backup --parallel-hash -s /path/to/bids_rawdata -d /path/to/datalad_destination

# Preview what would be done (dry run)
./prism2datalad.sh --dry-run -s /path/to/bids_rawdata -d /path/to/datalad_destination

# Skip GZIP header cleaning if needed (enabled by default)
./prism2datalad.sh --no-gzheader-check -s /path/to/bids_rawdata -d /path/to/datalad_destination

💾 Storage Efficiency

Git-Annex Integration

This tool automatically configures your DataLad dataset for optimal storage:

No File Duplication - Files are stored only once in git-annex, not in both working directory and .git
Symlink Structure - Working directory contains symlinks to git-annex content
On-Demand Access - Use datalad get to retrieve files when needed
Space Optimization - Significantly reduces storage requirements

Before and After

Traditional approach:

dataset/
├── sub-01_T1w.nii.gz         # 500MB file
└── .git/annex/objects/       # 500MB duplicate
    └── [hash]/sub-01_T1w.nii.gz

Our optimized approach:

dataset/
├── sub-01_T1w.nii.gz -> .git/annex/objects/[hash]  # symlink
└── .git/annex/objects/       # 500MB (only copy)
    └── [hash]/sub-01_T1w.nii.gz

📋 Prerequisites

Required Dependencies

The script automatically checks for these dependencies:

DataLad - Data management system
Deno - JavaScript runtime for BIDS validator
rsync - File synchronization utility
find, awk - Standard Unix utilities
SHA tools - Either sha256sum (Linux) or shasum (macOS)

Installation Commands

Ubuntu/Debian

# Install DataLad
sudo apt-get update
sudo apt-get install datalad

# Install Deno
curl -fsSL https://deno.land/x/install/install.sh | sh

# Other tools are usually pre-installed

macOS

# Install DataLad
brew install datalad

# Install Deno
curl -fsSL https://deno.land/x/install/install.sh | sh

# shasum is pre-installed on macOS

� Repository Structure

datalad/
├── prism2datalad.sh           # Main conversion script
├── gz_header_cleaner.py      # GZIP header cleaning tool (standalone)
├── compress_sourcedata.sh    # Helper script for data compression
├── gz_header_cleaner.py      # Python tool for GZIP header cleaning
├── README.md                 # This file
└── test/                     # Test datasets
    └── 107/                  # Example BIDS dataset

Main Scripts

prism2datalad.sh - Main conversion script with full DataLad integration
gz_header_cleaner.py - Standalone tool for cleaning GZIP headers (also integrated in main script)
compress_sourcedata.sh - Helper script for compressing BIDS datasets

�🔧 Usage

./prism2datalad.sh [OPTIONS] -s SOURCE_DIR -d DESTINATION_DIR

Command Line Options

Option	Description	Default
`-h`	Show help message and exit	-
`-s SOURCE_DIR`	Required. Source directory containing BIDS data	-
`-d DEST_DIR`	Required. Destination directory for DataLad datasets	-
`-c CONFIG_FILE`	BIDS validator configuration file (JSON format)	-
`--skip_bids_validation`	Skip BIDS format validation	`false`
`--dry-run`	Show what would be done without executing	`false`
`--backup`	Create backup of destination before overwriting	`false`
`--parallel-hash`	Use parallel processing for hash calculation	`false`
`--force-empty`	Overwrite non-empty destination directory (DANGEROUS)	`false`
`--fasttrack`	Speed up conversion by skipping checksum validation	`false`
`--quick-hash`	Sample-based hash validation (faster, less thorough)	`false`
`--skip-hash-validation`	Skip hash validation entirely (fastest, least safe)	`false`
`--no-gzheader-check`	Skip GZIP header cleaning (headers may cause BIDS warnings)	`false`
`--non-interactive`	Run without interactive prompts (for remote/automated use)	`false`
`--update`	Reuse existing DataLad dataset and ingest new/changed subjects	`false`
`--cleanup DATASET_PATH`	Safely remove a DataLad dataset with proper cleanup	-

📁 Directory Structure Requirements

Your source directory should follow BIDS structure:

your-study/
└── rawdata/           ← Point -s here (can be any name)
    ├── dataset_description.json
    ├── participants.tsv
    ├── sub-01/
    │   ├── anat/
    │   └── func/
    ├── sub-02/
    │   ├── anat/
    │   └── func/
    └── ...

Note: The source directory name doesn't have to be "rawdata" - it can be any name (e.g., "bids_data", "data", etc.). The script will preserve the original directory name in the DataLad structure.

📖 Examples

Basic Conversion

./prism2datalad.sh -s /data/my-study/rawdata -d /storage/datalad

Result: Creates /storage/datalad/my-study/rawdata/ with DataLad structure

BIDS Validator Configuration

./prism2datalad.sh -s /data/my-study/rawdata -d /storage/datalad -c /path/to/bids_config.json

Result: Use custom BIDS validator configuration to ignore specific warnings/errors

Example config file:

{
  "ignore": [
    {"code": "JSON_KEY_RECOMMENDED", "location": "/T1w.json"}
  ],
  "warning": [],
  "error": [{"code": "NO_AUTHORS"}]
}

Different Source Directory Names

./prism2datalad.sh -s /data/my-study/bids_data -d /storage/datalad

Result: Creates /storage/datalad/my-study/bids_data/ with DataLad structure

Safe Conversion with Backup

./prism2datalad.sh --backup -s /data/my-study/rawdata -d /storage/datalad

Result: Creates backup before conversion if destination exists

SSH Remote Access

# Copy from SSH remote source to local destination
./prism2datalad.sh -s user@server:/path/to/bids_data -d /local/destination

# Copy from local source to SSH remote destination  
./prism2datalad.sh -s /local/bids_data -d user@server:/path/to/destination

# Use quick hash validation for SSH (recommended for better performance)
./prism2datalad.sh --quick-hash -s user@server:/path/to/bids_data -d /local/destination

Result: Seamlessly copy from/to remote servers using rsync over SSH

Fast Conversion with Fasttrack Mode

./prism2datalad.sh --fasttrack -s /data/my-study/rawdata -d /storage/datalad

Result: Skips checksum validation for significantly faster processing (recommended for large datasets)

Fast Conversion with Parallel Processing

./prism2datalad.sh --parallel-hash -s /data/my-study/rawdata -d /storage/datalad

Result: Uses parallel hash calculation for faster verification

Ultimate Speed Conversion

./prism2datalad.sh --fasttrack --parallel-hash -s /data/my-study/rawdata -d /storage/datalad

Result: Combines fasttrack mode with parallel processing for maximum speed

Incremental Update (Add Newly Acquired Subjects)

./prism2datalad.sh --update -s /data/my-study/rawdata -d /storage/datalad

Result: Reuses the existing DataLad dataset and only ingests new or changed subjects from the source BIDS directory.

Preview Mode (Recommended First Run)

./prism2datalad.sh --dry-run -s /data/my-study/rawdata -d /storage/datalad

Result: Shows what would be done without making changes

Skip Validation (For Pre-validated Datasets)

./prism2datalad.sh --skip_bids_validation -s /data/my-study/rawdata -d /storage/datalad

Result: Skips BIDS validation step

Safe Mode (Force Empty Directory)

./prism2datalad.sh --force-empty -s /data/my-study/rawdata -d /storage/datalad

Result: Aborts if destination directory is not empty (safest option)

Skip GZIP Header Cleaning

./prism2datalad.sh --no-gzheader-check -s /data/my-study/rawdata -d /storage/datalad

Result: Disables automatic GZIP header cleaning (may cause BIDS validator warnings)

Full-Featured Conversion

./prism2datalad.sh --backup --parallel-hash -s /data/my-study/rawdata -d /storage/datalad

Result: Maximum safety with backup and parallel processing

Real-Time Progress Monitoring

All operations now provide live status updates:

📁 Counting files to copy (excluding .DS_Store files)...
Found 4,494 files to copy (excluding system files)
📁 Copying files from /source to /destination...
⚡ Fasttrack mode: Skipping checksum validation for speed
🚀 Starting file copy operation...
[====================] 100% (4,494/4,494 files)
✅ File copy completed, checking final disk space...
💾 Available disk space: 142GB

📂 Creating sub-datasets for each subject...
📊 Found 24 subjects to process
📁 [1/24] Creating sub-dataset for subject: sub-01
✅ [1/24] Sub-dataset created: sub-01
⚙️ [1/24] Configuring git-annex settings for sub-dataset: sub-01
...

🧹 GZIP Header Cleaner

Overview

The PRISM to DataLad conversion tool includes automatic GZIP header cleaning (enabled by default) to ensure BIDS compliance. Some GZIP files contain metadata in their headers (timestamps, original filenames) that can cause BIDS validator warnings. This tool removes these problematic fields while preserving the compressed content.

What It Does

The GZIP header cleaner:

✅ Removes MTIME (modification timestamp) - sets to 0
✅ Removes FNAME (original filename) - eliminates the field
✅ Removes FHCRC (header CRC) - removed when other fields change
✅ Preserves FEXTRA and FCOMMENT fields (kept unchanged)
✅ Maintains file content integrity - only header metadata is modified

Integration with Main Script

GZIP header cleaning is enabled by default during conversion:

# Default behavior - headers are automatically cleaned
./prism2datalad.sh -s /path/to/bids_data -d /path/to/destination

# Disable header cleaning if needed
./prism2datalad.sh --no-gzheader-check -s /path/to/bids_data -d /path/to/destination

The cleaning happens after files are copied but before DataLad operations, ensuring:

Files are validated for integrity first
Headers are cleaned for BIDS compliance
DataLad stores the corrected versions

Standalone Python Tool

The gz_header_cleaner.py script can also be used independently:

# Clean all .gz files in a directory
python3 gz_header_cleaner.py /path/to/bids/dataset

# Dry run to see what would be cleaned
python3 gz_header_cleaner.py --dry-run /path/to/bids/dataset

# Verbose output with detailed progress
python3 gz_header_cleaner.py --verbose /path/to/bids/dataset

# Quiet mode - only show summary
python3 gz_header_cleaner.py --quiet /path/to/bids/dataset

Usage Examples

Check what would be cleaned (dry run):

$ python3 gz_header_cleaner.py --dry-run --verbose /path/to/dataset
🔍 Found 150 .gz files to check...
✅ sub-01_T1w.nii.gz already has clean header
🧪 Would clean sub-01_bold.nii.gz (mtime=1759775498, filename)
🧪 Would clean sub-02_T1w.nii.gz (mtime=1759775500)
✅ sub-02_bold.nii.gz already has clean header
...
🧪 Dry run complete: 150 files checked, 45 would be cleaned, 0 errors

Clean headers:

$ python3 gz_header_cleaner.py --verbose /path/to/dataset
🔍 Found 150 .gz files to check...
✅ Cleaned sub-01_bold.nii.gz (mtime=1759775498, filename)
✅ Cleaned sub-02_T1w.nii.gz (mtime=1759775500)
✅ GZIP header cleaning complete: 45/150 files cleaned

Quiet operation:

$ python3 gz_header_cleaner.py --quiet /path/to/dataset
✅ GZIP header cleaning complete: 45/150 files cleaned

Why GZIP Header Cleaning Matters

Problem: Some GZIP compression tools (like standard gzip) include metadata in file headers:

MTIME: Timestamp when the file was compressed (changes on every compression)
FNAME: Original filename (may differ from current filename)

Impact on BIDS:

BIDS validator may report warnings about inconsistent metadata
File comparisons show differences even if content is identical
Version control systems detect changes unnecessarily

Solution: Clean headers ensure:

✅ BIDS compliance without warnings
✅ Consistent file hashes for same content
✅ Clean version control history
✅ Reproducible datasets

Technical Details

The tool uses a hybrid approach:

Detection (Bash): Fast header scanning to identify problematic files
Cleaning (Python): Safe binary manipulation to remove metadata
Verification: Content integrity is preserved, only headers are modified

Header Structure:

Bytes 0-9:   Fixed header (magic, compression method, flags, mtime, etc.)
Bytes 10+:   Optional fields (EXTRA, NAME, COMMENT, CRC, compressed data)

Cleaning Process:

Parse GZIP header structure
Clear MTIME field (set to 0)
Remove FNAME field if present
Remove FHCRC field if present
Preserve FEXTRA and FCOMMENT fields
Copy compressed data unchanged

Command-Line Options

Option	Description
`directory`	Path to directory containing .gz files (required)
`--dry-run`	Show what would be cleaned without making changes
`--verbose`, `-v`	Show detailed progress information
`--quiet`, `-q`	Suppress all output except errors
`--help`, `-h`	Show help message

Verification

After cleaning, verify the content is preserved:

# Original file
$ gunzip -c original.nii.gz | md5
abc123def456...

# After cleaning
$ python3 gz_header_cleaner.py /path/to/dataset
$ gunzip -c original.nii.gz | md5
abc123def456...  # Same hash - content preserved!

# But headers are different
$ md5 original.nii.gz
Before: xyz789abc123...
After:  def456ghi789...  # Different - only headers changed

Real-Time Progress Monitoring

All operations now provide live status updates:

📁 Counting files to copy (excluding .DS_Store files)...
Found 4,494 files to copy (excluding system files)
📁 Copying files from /source to /destination...
⚡ Fasttrack mode: Skipping checksum validation for speed
🚀 Starting file copy operation...
[====================] 100% (4,494/4,494 files)
✅ File copy completed, checking final disk space...
💾 Available disk space: 142GB

📂 Creating sub-datasets for each subject...
📊 Found 24 subjects to process
📁 [1/24] Creating sub-dataset for subject: sub-01
✅ [1/24] Sub-dataset created: sub-01
⚙️ [1/24] Configuring git-annex settings for sub-dataset: sub-01
...

📊 Output Structure

Given the command:

./prism2datalad.sh -s /data/study-name/rawdata -d /storage/datalad

The resulting DataLad structure will be:

/storage/datalad/study-name/rawdata/     ← DataLad superdataset
├── .datalad/                            ← DataLad metadata
├── .git/                                ← Git repository
│   └── annex/                           ← Git-annex content storage
│       └── objects/                     ← Actual file content
├── dataset_description.json             ← BIDS metadata (regular file)
├── participants.tsv                     ← BIDS participants file (regular file)
├── sub-01/                              ← DataLad subdataset
│   ├── .datalad/                        ← Subdataset metadata
│   ├── anat/
│   │   └── sub-01_T1w.nii.gz           ← Symlink to git-annex
│   └── func/
│       └── sub-01_task-rest_bold.nii.gz ← Symlink to git-annex
├── sub-02/                              ← DataLad subdataset
│   ├── .datalad/
│   └── ...
└── conversion_YYYYMMDD_HHMMSS.log       ← Conversion log

Important: The script preserves the original source directory name. If your source is /data/study/bids_data, the output will be /storage/datalad/study/bids_data/.

🗂️ Working with Git-Annex Files

Accessing Files

After conversion, large files are stored as symlinks. To access them:

# Get a specific file
datalad get sub-01/anat/sub-01_T1w.nii.gz

# Get all files in a directory
datalad get sub-01/func/

# Get all files in the dataset
datalad get -r .

Freeing Space

After you're done with files, you can free up space:

# Drop a specific file
datalad drop sub-01/anat/sub-01_T1w.nii.gz

# Drop all files in a directory
datalad drop sub-01/func/

# Drop all files in the dataset
datalad drop -r .

Checking File Status

# Check which files are present locally
datalad status

# Check file availability
git annex whereis sub-01/anat/sub-01_T1w.nii.gz

# List all files (both present and absent)
git annex list

🔍 Data Integrity Verification

Comprehensive File Checking

The script performs thorough integrity verification:

Pre-conversion verification - Checks all files are copied correctly
SHA-256 checksum validation - Every file is verified with checksums
Git-annex integrity check - Verifies git-annex storage is working
Progress monitoring - Real-time progress during verification

Verification Process

🔍 Performing comprehensive integrity validation...
This may take a while depending on dataset size...
Verifying 150 BIDS files...
[====================] 100% (150/150)
✅ All 150 files passed integrity verification
🔍 Verifying git-annex storage integrity...
✅ Git-annex tracking confirmed for: sub-01_T1w.nii.gz
✅ Git-annex storage verification passed

📝 Logging and Monitoring

Log File

All operations are logged to conversion_YYYYMMDD_HHMMSS.log with timestamps:

2025-07-17 14:30:15 - INFO - 🚀 Running BIDS Validator...
2025-07-17 14:30:20 - INFO - ✅ BIDS validation completed successfully!
2025-07-17 14:30:21 - INFO - 📂 Creating DataLad superdataset...
2025-07-17 14:30:25 - INFO - 📁 Creating sub-dataset for subject: sub-01
2025-07-17 14:30:30 - INFO - 📁 Copying files from source to destination...
2025-07-17 14:30:45 - INFO - 🔍 Performing comprehensive integrity validation...
2025-07-17 14:30:50 - INFO - ✅ All 150 files passed integrity verification
2025-07-17 14:30:55 - INFO - 🗂️ Configuring git-annex for optimized storage...
2025-07-17 14:31:00 - INFO - ✅ File content dropped successfully
2025-07-17 14:31:05 - INFO - ✅ DataLad conversion completed successfully!

Progress Monitoring

Real-time progress bars for file operations
File counting and processing status
Hash verification progress (with parallel option)
Git-annex configuration progress

Terminal Output

The script provides color-coded output:

🔵 Blue - Headers and important information
✅ Green - Success messages
⚠️ Yellow - Warnings
❌ Red - Errors

🛡️ Error Handling and Safety

Automatic Checks

✅ Dependency verification before execution
✅ Source directory validation (BIDS structure check)
✅ Path resolution and accessibility
✅ Destination directory collision detection
✅ DataLad structure detection in destination
✅ Empty directory enforcement (with --force-empty)

Safety Features

💾 Backup creation - Optional automatic backups with --backup
🧪 Dry run mode - Preview without changes with --dry-run
🔒 Force empty mode - Require empty destination with --force-empty
🔍 Interactive confirmations - For destructive operations
📊 Comprehensive logging - Full audit trail
🚨 DataLad conflict detection - Warns about existing DataLad datasets

Why Backup is Not Needed for Source

Important: You do NOT need to backup your source BIDS data because:

✅ Read-only operation - Source files are never modified
✅ Source remains untouched - Original BIDS dataset is completely preserved
✅ Creates new dataset - Conversion creates a new DataLad dataset elsewhere
✅ Git-annex integrity - Files are checksummed and tracked by git-annex

When Backup is Useful

The --backup option is for the destination directory only:

🔄 Re-running conversion - Backup existing DataLad dataset before overwriting
🛡️ Safety net - Preserve previous conversion attempts
📁 Destination conflicts - Handle existing destination directories safely

Automatic Safety Checks

✅ Dependency verification - Checks all required tools are installed
✅ Source validation - Verifies BIDS structure and accessibility
✅ Path resolution - Ensures all paths are valid and accessible
✅ Destination collision - Detects and handles existing destinations
✅ File integrity - Comprehensive SHA-256 verification
✅ Git-annex validation - Ensures storage optimization works correctly

Data Integrity

🔐 SHA-256 verification - Every file is hash-verified before and after git-annex conversion
📝 Operation logging - Complete operation history
🔄 Atomic operations - Rollback on failure
🗂️ Git-annex integrity - Verifies git-annex storage is working correctly

🚀 Performance Features

Speed Optimization Options

Fasttrack Mode (--fasttrack) - Skip checksum validation for 2-3x faster processing
Parallel Processing (--parallel-hash) - Use multiple CPU cores for hash calculation
Combined Mode - Use both options together for maximum speed
Optimized rsync - Efficient file copying with progress tracking
Smart .DS_Store Exclusion - Automatic filtering of macOS system files

Performance Comparison

Mode	Speed	Data Integrity	Use Case
Standard	Baseline	Full SHA-256 validation	Production, critical data
Fasttrack	2-3x faster	File size/timestamp validation	Large datasets, trusted sources
Parallel	1.5-2x faster	Full SHA-256 validation	CPU-bound operations
Combined	3-4x faster	File size/timestamp validation	Large datasets, time-critical

Real-time Monitoring

All processes now run transparently in the foreground with live status updates:

File counting and progress - See exactly what's being processed
Disk space monitoring - Real-time available space tracking
Subject-by-subject progress - Track sub-dataset creation ([1/24], [2/24], etc.)
DataLad operation visibility - See every git-annex and DataLad command
Error detection - Immediate feedback when issues occur

Resource Management

Memory efficient - Streams large files without loading into memory
Disk space aware - Checks available space before operations
Process monitoring - Graceful handling of interruptions
Git-annex optimization - Eliminates storage duplication

🐛 Troubleshooting

Common Issues

Missing Dependencies

❌ Missing required dependencies: deno datalad

Solution: Install missing dependencies using your package manager

Permission Issues

❌ Failed to create destination directory: /path/to/dest

Solution: Check write permissions or run with appropriate privileges

BIDS Validation Failure

❌ BIDS validation failed!

Solution: Fix BIDS structure or use --skip_bids_validation if false positive

Hash Mismatch

❌ Hash mismatch for file: /path/to/file

Solution: Check source file integrity and re-run conversion

Git-Annex Issues

❌ Git-annex tracking missing for: filename

Solution: Check git-annex installation and repository integrity

Git-Annex File Access Issues

If you can't access files after conversion:

# Check if files are present
datalad status

# Get missing files
datalad get <filename>

# Check git-annex status
git annex whereis <filename>

Debug Mode

For detailed debugging, check the log file:

tail -f conversion_$(date +%Y%m%d)_*.log

🔄 Migration from Previous Versions

If you're upgrading from an earlier version:

New git-annex optimization - Files are now stored efficiently
Enhanced integrity checking - More thorough file verification
Improved error handling - Better error messages and recovery
Enhanced logging - More detailed operation logs
Performance improvements - Faster execution with parallel options

Post-Conversion File Access

After conversion, remember to use datalad get to access files:

# Old way (files were always present)
cat sub-01/anat/sub-01_T1w.nii.gz

# New way (get files first)
datalad get sub-01/anat/sub-01_T1w.nii.gz
cat sub-01/anat/sub-01_T1w.nii.gz

🆕 Recent Updates

Version 2.2 Changes (Enhanced Transparency, Speed & BIDS Compliance)

🧹 GZIP Header Cleaning: Automatic removal of problematic GZIP metadata (MTIME, FNAME) for BIDS compliance (enabled by default)
🔧 BIDS Validator Configuration: Support for custom configuration files with -c option to ignore specific warnings/errors
🌐 SSH Remote Support: Seamless copying from/to SSH remote servers using rsync
⚡ Hash Validation Options: Three-tier system (full/quick/skip) for performance tuning on large datasets
⚡ Fasttrack Mode: New --fasttrack option skips checksum validation for 2-3x faster conversions
🔍 Real-time Transparency: All processes now run in foreground with live status updates
📊 Enhanced Progress Tracking: Subject-by-subject progress indicators and detailed status messages
🧹 Improved .DS_Store Handling: Comprehensive cleanup and exclusion of macOS system files
💾 Disk Space Monitoring: Real-time available space tracking throughout conversion
🛠️ Logging Improvements: Added missing log_warning and log_success functions
🔄 Process Visibility: See every DataLad and git-annex operation as it happens
⏱️ Time Estimation: Better progress indication with detailed operation descriptions
🚀 Performance Analysis: Speed comparison table and optimization recommendations
🔧 Bug Fixes: Fixed critical script abort issues and improved error handling

Version 2.1 Changes (Production-Ready)

🗂️ Git-Annex Storage Optimization: Eliminates file duplication using git-annex
🔐 Comprehensive Integrity Verification: SHA-256 checksum validation of every file
🔒 Production Safety: Added atomic operations, lock files, and comprehensive error recovery
🌐 System Validation: Network connectivity, filesystem compatibility, and dependency checking
📋 Checkpoint System: Resume capability with detailed progress tracking and recovery
💻 Resource Monitoring: Real-time disk space, memory, and CPU monitoring
🔄 Atomic Operations: Fail-safe copying with automatic rollback on failures
⚡ Enhanced Performance: Improved parallel processing and progress estimation
🛡️ Advanced Error Handling: Comprehensive pre-flight checks and validation
📊 Detailed Reporting: Enhanced logging with duration tracking and system information

Version 2.0 Changes

Fixed rawdata assumption: Script now preserves original source directory name instead of hardcoding "rawdata"
Timestamped log files: Log files now include timestamp in filename (e.g., conversion_20250710_194530.log)
Missing function fix: Added missing safe_datalad and dry_run_check functions
Enhanced safety checks: Added --force-empty option and DataLad structure detection
Improved destination handling: Better validation and backup options for non-empty directories
Interactive safety prompts: Multiple options when DataLad datasets are detected in destination
Improved error handling: Better error messages and debugging information
Enhanced documentation: Updated examples and usage instructions

Breaking Changes

Output path structure: The destination path now uses the actual source directory name instead of always using "rawdata"
Log file names: Log files now have timestamps in the filename
File access: Large files are now stored as git-annex symlinks, requiring datalad get to access
Process transparency: All operations now run in foreground - no background processes
Fasttrack mode: New speed optimization changes verification behavior

🤝 Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Development Setup

git clone https://github.com/MRI-Lab-Graz/datalad.git
cd datalad
chmod +x prism2datalad.sh
chmod +x gz_header_cleaner.py

Testing

# Test the main conversion script
./prism2datalad.sh --dry-run -s test/107 -d /tmp/test_output

# Test the GZIP header cleaner
python3 gz_header_cleaner.py --dry-run --verbose test/107

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
static		static
templates		templates
theme_template		theme_template
.gitignore		.gitignore
BIDScoin_no_docker.py		BIDScoin_no_docker.py
BIDScoin_study.py		BIDScoin_study.py
README.md		README.md
compress_sourcedata.sh		compress_sourcedata.sh
gz_header_cleaner.py		gz_header_cleaner.py
install.sh		install.sh
prism2datalad.sh		prism2datalad.sh
server.py		server.py

MRI-Lab-Graz/datalad

Folders and files

Latest commit

History

Repository files navigation

🧠 PRISM to DataLad Conversion Tool

Main Scripts

🔧 Usage(https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

✨ Features

🏭 Production-Ready Features

Enterprise-Grade Safety

Data Integrity Assurance

🚀 Quick Start

💾 Storage Efficiency

Git-Annex Integration

Before and After

📋 Prerequisites

Required Dependencies

Installation Commands

Ubuntu/Debian

macOS

� Repository Structure

Main Scripts

�🔧 Usage

Command Line Options

📁 Directory Structure Requirements

📖 Examples

Basic Conversion

BIDS Validator Configuration

Different Source Directory Names

Safe Conversion with Backup

SSH Remote Access

Fast Conversion with Fasttrack Mode

Fast Conversion with Parallel Processing

Ultimate Speed Conversion

Incremental Update (Add Newly Acquired Subjects)

Preview Mode (Recommended First Run)

Skip Validation (For Pre-validated Datasets)

Safe Mode (Force Empty Directory)

Skip GZIP Header Cleaning

Full-Featured Conversion

Real-Time Progress Monitoring

🧹 GZIP Header Cleaner

Overview

What It Does

Integration with Main Script

Standalone Python Tool

Usage Examples

Check what would be cleaned (dry run):

Clean headers:

Quiet operation:

Why GZIP Header Cleaning Matters

Technical Details

Command-Line Options

Verification

Real-Time Progress Monitoring

📊 Output Structure

🗂️ Working with Git-Annex Files

Accessing Files

Freeing Space

Checking File Status

🔍 Data Integrity Verification

Comprehensive File Checking

Verification Process

📝 Logging and Monitoring

Log File

Progress Monitoring

Terminal Output

🛡️ Error Handling and Safety

Automatic Checks

Safety Features

Why Backup is Not Needed for Source

When Backup is Useful

Automatic Safety Checks

Data Integrity

🚀 Performance Features

Speed Optimization Options

Performance Comparison

Real-time Monitoring

Resource Management

Packages