A set of USDM and other utilities
- Excel to USDM JSON Converter (from_excel.py)
- USDM to M11 Inclusion/Exclusion Criteria Converter (to_m11.py)
- Visit-Based Timeline Visualization (to_pj.py)
- D3 Timeline Visualization (to_timeline.py)
- OCR Text Extraction from Images (to_text.py)
- Image Renamer that renames image files based on their metadata timestamps (rename_images.py)
A Python program that converts USDM (Unified Study Definitions Model) JSON files into HTML documents displaying inclusion and exclusion criteria in M11 ICH format.
This tool reads USDM JSON study definition files and generates a formatted HTML page containing the inclusion and exclusion criteria sections (5.2 and 5.3) in a format suitable for M11 regulatory documentation. The program intelligently resolves USDM references and tags to create human-readable criteria statements.
- USDM4 Compliance: Built using the USDM4 library for proper study definition handling
- Reference Resolution: Automatically resolves
usdm:refandusdm:tagreferences within criteria text - M11 Formatting: Outputs criteria in standard M11 section format (5.2 for Inclusion, 5.3 for Exclusion)
- Times New Roman Styling: Uses Times New Roman font at 12pt for regulatory document compliance
- Bootstrap UI: Clean, professional styling with Bootstrap 5 (Zephyr theme)
- Error Handling: Comprehensive error logging and reporting
- Recursive Translation: Handles nested references within criteria text
pip install beautifulsoup4
pip install yattag
pip install usdm4
pip install simple-error-log- BeautifulSoup4: HTML parsing and manipulation
- yattag: HTML document generation
- usdm4: USDM data model and builder
- simple-error-log: Error logging and reporting
Basic usage:
python3 to_m11.py <usdm_json_file># Generate HTML from a USDM JSON file
python3 to_m11.py study_definition.json
# Process a file in a specific directory
python3 to_m11.py /path/to/phuse_eu/M11_USDM.json
# The output file will be created in the same directory with "_criterion" suffix
# Input: M11_USDM.json
# Output: M11_USDM_criterion.htmlThe program expects a USDM JSON file containing:
{
"id": "...",
"identifier": "1",
"category": {
"code": "C25532", // For inclusion criteria
// OR
"code": "C25370" // For exclusion criteria
},
"criterionItemId": "..."
}{
"id": "...",
"text": "Patient must be ≥18 years of age",
"dictionaryId": "..." // Optional, for parameter references
}The program handles special USDM markup within criterion text:
<usdm:ref id="..." attribute="..." />: References to other USDM objects<usdm:tag name="..." />: Dictionary parameter references<u>...</u>: Underlined text<i>...</i>: Italic text
The program generates a self-contained HTML file with:
<!DOCTYPE html>
<html>
<head>
<!-- Bootstrap 5 CSS (Zephyr theme) -->
<!-- Bootstrap Icons -->
<!-- Custom styling for M11 format -->
</head>
<body>
<div class="container-fluid">
<h2>5.2 Inclusion Criteria</h2>
<p>To be eligible to participate in this trial...</p>
<table>
<!-- Numbered inclusion criteria -->
</table>
<h2>5.3 Exclusion Criteria</h2>
<p>An individual who meets any of the following criteria...</p>
<table>
<!-- Numbered exclusion criteria -->
</table>
</div>
</body>
</html>- Font: Times New Roman, 12pt (body text), 14pt (headings)
- Layout: Full-width container with padding
- Theme: Bootstrap 5 Zephyr theme with gradient header
- Responsive: Mobile-friendly design
Main class handling the conversion process:
class IE:
def __init__(self, file_path: str, errors: Errors)
def to_html(self) -> str
def _ie_data(self) -> tuple[list, list]
def _generate_html(self, inclusion: str, exclusion: str)
def _ie_table(self, doc, criteria: list)
def _translate_references(self, instance: dict, text: str) -> str
def _translate_references_recurse(self, instance: dict, text: str) -> str
def _resolve_usdm_ref(self, instance, ref) -> str
def _resolve_usdm_tag(self, instance, ref) -> str- Load USDM Data: Parse JSON file using USDM4 Builder
- Extract Criteria: Filter EligibilityCriterion objects by category code
- Resolve References: Process
usdm:refandusdm:tagelements - Generate HTML: Create formatted HTML document with Bootstrap styling
- Save Output: Write prettified HTML to file
- C25532: Inclusion Criteria (from CDISC CT)
- C25370: Exclusion Criteria (from CDISC CT)
The program recursively resolves nested references:
- Step 1: Wrap underline and italic tags around references
- Step 2: Resolve
usdm:refby looking up referenced object and attribute - Step 3: Resolve
usdm:tagby looking up dictionary parameter map - Step 4: Recursively process resolved text for nested references
The program uses the simple-error-log library for comprehensive error tracking:
- Exception Handling: All exceptions are caught and logged
- Warning Capture: BeautifulSoup warnings are captured and logged
- Debug Information: Detailed debug messages for troubleshooting
- Error Summary: Final error report printed to console
Errors: {
"errors": [],
"warnings": [
"Warning raised within Soup package..."
],
"debug": []
}
- USDM4 Dependency: Requires properly formatted USDM4 JSON files
- Category Codes: Only processes criteria with standard CDISC category codes
- Reference Types: Limited to
usdm:refandusdm:tagreference types - Single File: Processes one file at a time (no batch processing)
Given a USDM JSON file with the following criteria:
{
"identifier": "1",
"text": "Patient must be ≥<usdm:ref id='age_param' attribute='value'/> years of age"
}The program generates:
<table>
<tr>
<td style="vertical-align: top;">1:</td>
<td style="vertical-align: top;">Patient must be ≥18 years of age</td>
</tr>
</table>This program is part of a larger USDM utility suite:
USDM JSON → to_m11.py → M11 Criteria HTML
↓
(Also available:)
to_pj.py → Visit Timeline HTML
to_visit.py → Visit Details HTML
to_timeline.py → Study Timeline HTML
Issue: "Failed generating HTML page"
- Solution: Verify USDM JSON file is well-formed and contains valid USDM4 data
Issue: Missing dictionary reference
- Solution: Ensure dictionaryId is set and dictionary contains required parameterMaps
Issue: Unresolved references
- Solution: Check that all referenced IDs exist in the USDM data
Issue: Import errors
- Solution: Install all required dependencies:
pip install beautifulsoup4 yattag usdm4 simple-error-log
The program automatically generates output filenames:
Input: study_definition.json
Output: study_definition_criterion.html
Input: /path/to/M11_USDM.json
Output: /path/to/M11_USDM_criterion.html
- Validate Input: Ensure USDM JSON is valid before processing
- Check Output: Review generated HTML for proper reference resolution
- Monitor Errors: Check error log output for warnings or issues
- Test References: Verify all
usdm:refandusdm:tagelements resolve correctly - Browser Testing: Open generated HTML in multiple browsers to verify rendering
- to_pj.py: Visit-based timeline visualization
- to_visit.py: Visit details and activities
- to_timeline.py: Study timeline visualization
- phuse_eu/: Example USDM files and generated outputs
See LICENSE file for details.
For issues or questions, please refer to the project repository or contact the development team.
- Current: Initial release with USDM4 support
- Inclusion/Exclusion criteria extraction
- Reference resolution (usdm:ref and usdm:tag)
- M11 formatting with Times New Roman styling
- Bootstrap 5 UI with Zephyr theme
A Python program that generates interactive HTML pages with Bootstrap 5 cards and D3.js vertical serpentine timeline visualizations for clinical trial data.
This tool creates a detailed view of clinical trial visits, displaying each visit as a separate Bootstrap card. Each card contains a vertical serpentine timeline of the activities performed during that visit. The visualization is interactive with tooltips showing detailed information about procedures and notes.
- Bootstrap 5 Cards: Each visit displayed in a professional, styled card
- Vertical Serpentine Layout: Activities flow vertically down and up alternating columns (like reading a newspaper)
- Interactive Tooltips: Hover over activity nodes to see:
- Activity title
- Associated procedures
- Notes and details
- Responsive Design: Modern gradient headers and professional styling
- Visit Information Panel: Displays visit type, duration, timing, and notes
- D3.js Powered: Uses D3.js v7 for smooth, interactive visualizations
The vertical serpentine layout arranges activities in columns, alternating direction:
Column 1: Column 2: Column 3:
↓ ↑ ↓
Activity 1 Activity 7 Activity 13
↓ ↑ ↓
Activity 2 Activity 6 Activity 14
↓ ↑ ↓
Activity 3 Activity 5 Activity 15
↓ ↙ ↑ ↓
Activity 4 Activity 8 ...
↘ ↑
This creates a continuous vertical path through all activities, making it easy to follow the sequence while efficiently using space.
The input data should be a JSON object with a visits array:
{
"visits": [
{
"title": "Visit 1",
"type": "In Person",
"timing": "Day 1",
"duration": "P1D",
"notes": "Optional notes about the visit",
"activities": [
{
"title": "Activity name",
"procedures": ["Procedure 1", "Procedure 2"],
"notes": "Optional activity notes"
}
]
}
]
}Generate an HTML file from JSON data:
python3 to_pj.py <input_json_file> [output_html_file]Examples:
# Generate with default output filename (visit_timeline.html)
python3 to_pj.py pj/pj_p1.json
# Specify custom output filename
python3 to_pj.py pj/pj_p1.json my_visits.html
# Test with example data
python3 to_pj.py pj/pj_p2.json clinical_visits.htmlThe script creates a self-contained HTML file that can be opened directly in any web browser.
The generate_html() function can be used within your Python application:
from to_pj import load_json_data, generate_html
# Load your data
data = load_json_data('pj/pj_p1.json')
# Generate HTML visualization
generate_html(data, 'output.html')Or integrate with Flask/Django:
from flask import Flask, render_template_string
import json
from to_pj import generate_html
app = Flask(__name__)
@app.route('/visits')
def show_visits():
with open('data.json', 'r') as f:
data = json.load(f)
# Generate and serve the HTML
html_content = generate_html(data, 'temp_visits.html')
with open('temp_visits.html', 'r') as f:
return f.read()You can customize the visualization by modifying the config object in the JavaScript:
const config = {
nodeRadius: 6, // Size of activity circles
colWidth: 120, // Horizontal spacing between columns
nodeSpacing: 70, // Vertical spacing between nodes
nodesPerCol: 6, // Number of activities per column
margin: { top: 20, right: 30, bottom: 20, left: 30 }
};Modify the CSS to customize:
Colors:
- Visit card headers: Change gradient colors in
.card-headerbackground - Activity nodes: Modify
.activity-circlefill and stroke colors - Timeline path: Adjust
.timeline-pathstroke color
Layout:
- Card spacing: Modify
.visit-cardmargin-bottom - Card shadows: Adjust box-shadow properties
- Timeline container height: Change
.timeline-containermin-height
Tooltips:
- Background: Modify
.tooltipbackground-color - Border and shadow: Adjust border and box-shadow properties
.
├── to_pj.py # Main program
├── README_TO_PJ.md # This documentation
├── pj/
│ ├── data_v2.json # Data schema
│ ├── pj_p1.json # Example data file 1
│ ├── pj_p2.json # Example data file 2
│ └── pj_p3.json, etc. # Additional example files
├── pj_p1_visits_vertical.html # Generated example output
└── pj_p2_visits.html # Generated example output
- D3.js v7: Loaded from CDN (https://d3js.org/d3.v7.min.js)
- Bootstrap 5: Loaded from CDN (https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/)
- Python 3.6+: For running the generator script
No installation required - all libraries are loaded from CDNs.
The visualization requires a modern web browser with:
- ES6 JavaScript support
- SVG rendering capabilities
- CSS3 support (gradients, transitions)
- D3.js v7 compatibility
Tested on:
- Chrome 90+
- Firefox 88+
- Safari 14+
- Edge 90+
- Header: Gradient background with visit title and badge showing visit number
- Info Panel: Displays visit metadata (type, duration, timing, notes)
- Timeline Section: Interactive D3.js visualization of activities
- Hover Effects: Cards lift and enhance shadow on hover
- Vertical Serpentine Path: Connects all activities in a continuous flow
- Activity Nodes: Interactive circles that respond to hover events
- Text Labels: Activity titles with automatic text wrapping
- No Activities State: Friendly message when no activities are recorded
- Rich Information: Shows activity title, procedures, and notes
- Smart Positioning: Appears next to cursor, automatically positioned
- Styled Content: Formatted with headers, bullet points, and sections
Example visualizations have been generated:
pj_p1_visits_vertical.html- 20 visits from pj/pj_p1.jsonpj_p2_visits.html- 12 visits from pj/pj_p2.json
Open these files in a web browser to see the visualization in action.
Issue: Cards don't appear
- Check browser console for JavaScript errors
- Verify data format matches expected schema
- Ensure D3.js and Bootstrap CDNs are accessible
Issue: Vertical layout looks cramped
- Adjust
colWidthto increase horizontal spacing - Modify
nodeSpacingto increase vertical spacing - Change
nodesPerColto show fewer activities per column
Issue: Tooltips don't show
- Ensure you're hovering over activity circles (not just near them)
- Check that tooltip CSS isn't being overridden by other styles
- Verify JavaScript console for errors
Issue: Text labels overlap
- Increase
colWidthin config - Modify font size in
.activity-labelCSS - Adjust text wrapping logic in the label creation code
| Feature | to_pj.py | generate_timeline.py |
|---|---|---|
| Layout | Vertical serpentine per visit | Horizontal serpentine for all visits |
| Organization | Bootstrap cards (one per visit) | Single timeline page |
| Focus | Detailed activity view per visit | Overview of all visits |
| Best for | Activity-level analysis | Study timeline overview |
See LICENSE file for details.
For issues or questions, please refer to the project repository.
A Python program that generates interactive HTML pages with D3.js visualizations displaying USDM (Unified Study Definitions Model) study timelines in a horizontal flowchart format.
This tool reads USDM JSON study definition files and creates sophisticated timeline visualizations showing the flow of scheduled activities, decisions, timing relationships, and conditional branches. The visualization uses D3.js to create interactive flowchart-style diagrams that clearly represent the structure and logic of clinical trial schedules.
- Horizontal Flowchart Layout: Activities arranged in a left-to-right timeline
- Multiple Node Types:
- Activity nodes (circles): Scheduled activities
- Decision nodes (diamonds): Branching points with conditions
- Entry/Exit nodes (rounded rectangles): Timeline start and end points
- Timing nodes (circles with icon/text): Timing relationships between activities
- Anchor nodes (ship's anchor icon): Fixed reference points for timing
- Conditional Branches: Visual representation of decision logic with labeled paths above the main timeline
- Orphan Node Support: Handles off-timeline activities connected via conditional links
- Timing Relationships: Visual "from/to" connections showing relative timing between activities
- Interactive Tooltips: Hover over nodes to see detailed information
- Multiple Timelines: Generates separate visualizations for each timeline in the study
- Entry Conditions: Displays entry criteria for each timeline
- Represent scheduled activities in the study
- Gray filled circles with black border
- Display activity label with automatic text wrapping
- Tooltip shows: title, type, description
- Represent branching points in the timeline
- Diamond-shaped with gray fill and black border
- Can have multiple conditional outgoing links
- Tooltip shows: title, type, description
- Mark the beginning and end of a timeline
- Light gray rounded rectangles
- Entry appears at the start, Exit at the end
- Height is half of regular nodes for visual distinction
- Show timing relationships between activities
- White circles with blue border
- Display timing value and window information
- Connected to "from" and "to" activity nodes with blue arrows
- Can display as anchor icon for Fixed Reference timing
- Represent fixed timing references (CDISC code C201358)
- Display as ship's anchor icon instead of text
- Blue colored to match timing theme
- Used for absolute timing references in the study
Entry → Activity 1 → Decision → Activity 2 → Exit
↓
(Condition A)
↓
Activity 3 → Activity 4
↓
(rejoin main timeline)
Conditional branches appear above the main timeline with labeled paths showing which condition leads where.
Timing relationships appear below the main timeline, showing relative timing between activities.
pip install usdm4
pip install simple-error-log- usdm4: USDM data model and builder for parsing study definitions
- simple-error-log: Error logging and reporting
- D3.js v7: Loaded from CDN in generated HTML
Basic usage:
python3 to_timeline.py <usdm_json_file># Generate timeline from USDM JSON file
python3 to_timeline.py test_data/NCT12345678.json
# Process file in specific directory
python3 to_timeline.py phuse_eu/M11_USDM.json
# Output file created with "_timeline" suffix
# Input: M11_USDM.json
# Output: M11_USDM_timeline.htmlThe program automatically creates the output file in the same directory as the input, with _timeline appended to the filename.
The program expects a USDM4-compliant JSON file containing:
{
"study": {
"versions": [{
"studyDesigns": [{
"scheduleTimelines": [...]
}]
}]
}
}Each timeline contains:
- id: Unique identifier
- label: Timeline name/description
- entryCondition: Condition required to enter this timeline
- entryId: Reference to the first scheduled instance
- timings: Array of timing relationships
- Scheduled instances: Activities and decisions in sequence
ScheduledActivityInstance:
{
"id": "...",
"instanceType": "ScheduledActivityInstance",
"label": "Screening Visit",
"description": "Initial screening procedures",
"defaultConditionId": "next_instance_id",
"timelineExitId": "exit_id"
}ScheduledDecisionInstance:
{
"id": "...",
"instanceType": "ScheduledDecisionInstance",
"label": "Eligibility Decision",
"description": "Determine patient eligibility",
"defaultConditionId": "default_path_id",
"conditionAssignments": [
{
"condition": "Eligible",
"conditionTargetId": "target_instance_id"
}
]
}Generates a self-contained HTML file with:
<!DOCTYPE html>
<html>
<head>
<title>USDM Timeline Visualization</title>
<script src="https://d3js.org/d3.v7.min.js"></script>
<!-- Embedded CSS styling -->
</head>
<body>
<h1>USDM Timeline Visualization</h1>
<div id="timelines">
<!-- Timeline containers created dynamically -->
</div>
<div class="tooltip" id="tooltip"></div>
<!-- Embedded JavaScript with D3 visualization -->
</body>
</html>- Timeline Container: White card with rounded corners and shadow
- Timeline Title: Bold heading with timeline label
- Entry Condition: Italic text showing entry requirements
- Main Timeline: Horizontal flow of nodes and links
- Conditional Links: Orthogonal paths above main timeline
- Timing Links: Blue connections below main timeline
- Orphan Nodes: Additional nodes connected via conditions
- Interactive Tooltips: Information on hover
Main class handling timeline processing and HTML generation:
class Timeline:
def __init__(self, file_path: str, errors: Errors)
def to_html(self) -> str
def _generate_html(self, study_design: StudyDesign) -> str
def _process_timeline(self, timeline) -> dict
def _format_timelines_data(self, timelines_data) -> str
def _get_cross_reference(self, id) -> dict- Load USDM Data: Parse JSON using USDM4 Builder
- Extract Study Design: Get first study version and design
- Process Each Timeline:
- Build main timeline chain (entry → activities → exit)
- Identify conditional branches and targets
- Collect orphan nodes (off-timeline activities)
- Process timing relationships
- Calculate node positions
- Generate HTML: Embed data and D3.js visualization code
- Save Output: Write HTML to file with
_timelinesuffix
Horizontal Positioning:
- Nodes spaced evenly left-to-right
- Fixed horizontal spacing (250px default)
- Entry/exit nodes have half height
Vertical Organization:
- Orphan nodes: Top row (Y=20)
- Conditional links: Above main timeline with staggered heights
- Main timeline: Center horizontal line
- Timing nodes: Below main timeline
Dynamic Spacing:
- SVG height calculated based on number of orphan nodes and conditional links
- Automatic margin adjustments to prevent clipping
- Horizontal scrolling enabled for wide timelines
Shapes:
- Activities:
<circle>elements - Decisions:
<path>elements (diamond shape) - Entry/Exit:
<rect>elements with rounded corners - Timing:
<circle>elements or custom anchor SVG paths
Links:
- Main flow: Gray dashed lines with black arrowheads
- Timing: Blue solid lines with blue arrowheads labeled "from" and "to"
- Conditional: Black solid orthogonal lines with condition labels
Interaction:
- Hover events on all nodes
- Tooltip positioning relative to cursor
- Node cursor changes to pointer on hover
The D3 visualization behavior can be customized by modifying constants in the JavaScript:
const activityNodeRadius = 40; // Radius of activity circles
const nodeWidth = 80; // Width for rectangular nodes
const nodeHeight = 80; // Height for rectangular nodes
const timingNodeRadius = 40; // Radius of timing circles
const horizontalSpacing = 250; // Space between nodes horizontally
const verticalSpacing = 150; // Space for timing nodes below
const marginLeft = 50; // Left page margin
const marginTop = 50; // Top page margin (plus dynamic adjustments)
const marginRight = 50; // Right page margin
const marginBottom = 200; // Bottom page marginComprehensive error tracking using simple-error-log:
Timeline Processing Errors:
Failed processing timeline {timeline.label}
- Missing required instances
- Invalid cross-references
- Malformed condition assignments
HTML Generation Errors:
Failed generating HTML page
- Invalid study design structure
- Missing timeline data
- JSON serialization issues
Input file: test_data/NCT12345678.json
Output path: test_data
Output file: test_data/NCT12345678_timeline.html
Errors: {...}
If errors occur, they are printed to console with details. Otherwise, success message is shown.
- Activity nodes: Gray (#C0C0C0) fill, black stroke
- Decision nodes: Gray fill, black stroke
- Entry/Exit nodes: Light gray (#D3D3D3) fill, black stroke
- Timing nodes: White fill, blue (#003366) stroke
- Anchor icons: Blue (#003366)
- Links: Gray (#999) main flow, blue (#003366) timing
- Conditional links: Black (#000000)
- Background: White (#FFFFFF) cards on light gray (#F5F5F5) page
- Timeline title: 24px bold, dark gray (#333)
- Entry condition: 14px italic, medium gray (#666)
- Node labels: 12px, dark gray (#333), auto-wrapped
- Link labels: 10px, medium gray (#666)
- Conditional labels: 10px bold, black
- Timing labels: 10px bold, blue
- Card container: White background, rounded corners (8px), shadow
- Card padding: 20px all around
- Card margin: 40px bottom spacing between timelines
- Horizontal scrolling: Enabled for wide timelines
- USDM4 Dependency: Requires valid USDM4 JSON structure
- Single Study Design: Processes first study design only
- Cross-Reference Integrity: Assumes all referenced IDs exist
- Performance: Very complex timelines (100+ nodes) may render slowly
- Browser Dependency: Requires modern browser with SVG and ES6 support
Entry → Screening → Baseline → Treatment → Follow-up → Exit
Entry → Screening → Eligibility Decision
↓ (Eligible)
Treatment → Follow-up → Exit
↓ (Not Eligible)
Screen Failure → Exit
Entry → Visit 1 → Visit 2 → Visit 3 → Exit
↓ ↑
(2 weeks from Visit 1 to Visit 2)
Entry → Activity 1 → Decision A
↓ (main path)
Activity 2 → Exit
↓ (condition X)
Orphan 1 → Orphan 2
↓
(rejoins at Activity 2)
| Feature | to_timeline.py | to_pj.py | to_visit.py |
|---|---|---|---|
| Data Source | USDM JSON | Custom JSON | USDM JSON |
| Layout | Horizontal flowchart | Vertical serpentine cards | Table format |
| Focus | Timeline structure & logic | Visit activities | Visit details |
| Interaction | Tooltips only | Tooltips | Filterable table |
| Best For | Understanding flow | Activity sequences | Detailed visit info |
| Complexity | High (decisions, timings) | Medium (activities) | Low (data display) |
Input: study_definition.json
Output: study_definition_timeline.html
Input: test_data/NCT12345678.json
Output: test_data/NCT12345678_timeline.html
Issue: "Failed generating HTML page"
- Solution: Verify USDM JSON has proper study structure with versions and studyDesigns
- Check that all cross-references (IDs) are valid
Issue: Missing nodes in visualization
- Solution: Verify
defaultConditionIdcreates valid chain from entry to exit - Check that
entryIdreferences a valid scheduled instance
Issue: Conditional links don't appear
- Solution: Verify
conditionAssignmentsarray is populated in decision nodes - Ensure
conditionTargetIdreferences valid instances
Issue: Timing node missing anchor icon
- Solution: Check that timing has
type.code== "C201358" for Fixed Reference - Verify timing type object is properly formed
Issue: Timeline too wide
- Solution: Reduce
horizontalSpacingconstant in JavaScript - Consider breaking into multiple timelines if appropriate
Issue: Overlapping conditional links
- Solution: Algorithm automatically staggers heights; may need manual adjustment for extreme cases
- Check that conditional assignments are in reasonable order
- Check Console: Open browser developer tools to see JavaScript errors
- Inspect Data: The
timelinesDatavariable in generated HTML shows processed data - Validate JSON: Use USDM4 validator before processing
- Enable Error Log: Check console output for detailed error messages
- Simplify Test: Start with simple linear timeline before adding complexity
- Organize Timelines: Use separate timelines for different study phases
- Label Clearly: Use descriptive labels for activities and decisions
- Condition Naming: Use clear, distinct names for conditional branches
- Timing Types: Use appropriate CDISC codes for timing relationships
- Cross-Reference Integrity: Ensure all IDs are unique and properly referenced
- Testing: Validate output in multiple browsers
- Complexity Management: Keep timelines reasonably sized (< 50 nodes ideal)
USDM JSON → to_timeline.py → Study Timeline HTML (flowchart view)
→ to_visit.py → Visit Table HTML (detailed view)
→ to_pj.py → Visit Cards HTML (activity view)
→ to_m11.py → Criteria HTML (regulatory view)
Each tool provides a different perspective on the study data:
- to_timeline.py: High-level flow and structure
- to_visit.py: Detailed visit information
- to_pj.py: Activity-focused visualization
- to_m11.py: Regulatory document format
Tested and verified on:
- Chrome 90+
- Firefox 88+
- Safari 14+
- Edge 90+
Requirements:
- SVG rendering
- ES6 JavaScript (arrow functions, const/let, template literals)
- D3.js v7 compatibility
- CSS3 (flexbox, gradients, transforms)
| Nodes | Rendering | User Experience |
|---|---|---|
| < 20 | Instant | Excellent |
| 20-50 | < 1s | Good |
| 50-100 | 1-2s | Acceptable |
| > 100 | > 2s | May need optimization |
- Break Large Timelines: Split into logical phases
- Reduce Orphan Chains: Simplify conditional branches
- Limit Timing Nodes: Show only critical timing relationships
- Simplify Labels: Keep node labels concise
- Test Performance: Verify rendering time with real data
See LICENSE file for details.
For issues or questions, please refer to the project repository or contact the development team.
- to_timeline.py: Main program file
- test_data/: Example USDM JSON files
- phuse_eu/: Additional example files and outputs
- to_visit.py: Complementary visit details tool
- to_pj.py: Activity timeline visualization tool
- to_m11.py: Regulatory criteria extraction tool
A lightweight Python program that extracts text from images using Tesseract OCR (Optical Character Recognition) technology. This utility converts images containing text into plain text files.
This tool uses the Tesseract OCR engine through the pytesseract Python wrapper to extract text from image files. It's designed to be simple and straightforward - provide an image file and get a text file with the extracted content. Useful for digitizing scanned documents, extracting text from screenshots, or processing image-based data.
- Simple Command-Line Interface: Single command execution with minimal arguments
- Automatic File Naming: Output text file automatically named based on input filename
- Tesseract OCR: Leverages powerful open-source OCR engine
- Multiple Format Support: Works with common image formats (PNG, JPEG, TIFF, BMP, etc.)
- Preserves Directory Structure: Outputs text file to same directory as input image
- Status Reporting: Displays input and output filenames for confirmation
Tesseract OCR Engine:
The Tesseract OCR engine must be installed on your system:
macOS (using Homebrew):
brew install tesseractLinux (Ubuntu/Debian):
sudo apt-get install tesseract-ocrLinux (Fedora/RHEL):
sudo yum install tesseractWindows:
- Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
- Follow installation wizard
- Add Tesseract to system PATH
pip install Pillow
pip install pytesseract- Pillow (PIL): Python Imaging Library for image file handling
- pytesseract: Python wrapper for Tesseract OCR engine
- Tesseract: The actual OCR engine (system-level dependency)
Basic usage:
python3 to_text.py <image_file># Extract text from a screenshot
python3 to_text.py screenshot.png
# Process a scanned document
python3 to_text.py /path/to/scanned_document.jpg
# Extract text from image in current directory
python3 to_text.py meeting_notes.jpeg
# The output will be automatically created with .txt extension
# Input: screenshot.png
# Output: screenshot.txtWhen you run the program, it displays the file paths:
Input filename: /path/to/image.png
Output filename: /path/to/image.txt
The program supports all image formats that PIL/Pillow can open:
- PNG (.png) - Recommended for screenshots and digital images
- JPEG (.jpg, .jpeg) - Common photograph format
- TIFF (.tif, .tiff) - Often used for scanned documents
- BMP (.bmp) - Windows bitmap format
- GIF (.gif) - Graphics interchange format
- WebP (.webp) - Modern web image format
For best OCR results:
- Resolution: Minimum 300 DPI for scanned documents
- Contrast: High contrast between text and background
- Clarity: Clear, non-blurry text
- Orientation: Text should be right-side up and not rotated
- Language: Tesseract defaults to English (additional languages can be configured)
The program creates a plain text file containing the extracted text:
Input: document.png
Output: document.txt
Input: /path/to/scan.jpg
Output: /path/to/scan.txt
The text file contains:
- All text extracted from the image
- Original line breaks and spacing (as detected by OCR)
- No formatting (plain text only)
- UTF-8 encoding
Input Image (screenshot.png):
Meeting Notes
Date: November 26, 2025
Attendees: John, Sarah, Mike
Output Text (screenshot.txt):
Meeting Notes
Date: November 26, 2025
Attendees: John, Sarah, Mike
def save_text(file_path, data):
# Saves extracted text to file
if __name__ == "__main__":
# 1. Parse command-line arguments
# 2. Determine input/output file paths
# 3. Open image using PIL
# 4. Extract text using pytesseract
# 5. Save text to output file- Parse Arguments: Extract filename from command line
- Path Processing:
- Split path into directory and filename
- Extract root filename and extension
- Construct output filename with .txt extension
- Image Loading: Use PIL to open and load image
- OCR Processing: Call Tesseract via pytesseract to extract text
- Save Output: Write extracted text to .txt file
The program intelligently handles file paths:
- Preserves original directory structure
- Works with relative and absolute paths
- Uses current working directory if no path specified
- Automatically creates appropriate .txt filename
To extract text in languages other than English, you can modify the code:
# English (default)
extracted_text = pytesseract.image_to_string(image)
# Spanish
extracted_text = pytesseract.image_to_string(image, lang='spa')
# French
extracted_text = pytesseract.image_to_string(image, lang='fra')
# German
extracted_text = pytesseract.image_to_string(image, lang='deu')
# Multiple languages
extracted_text = pytesseract.image_to_string(image, lang='eng+fra')Note: Additional language packs must be installed for Tesseract.
If Tesseract is not in your system PATH on Windows, specify the path:
# Add this line before calling image_to_string
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'-
OCR Accuracy:
- Not 100% accurate, especially with poor image quality
- May misread similar characters (e.g., '0' vs 'O', '1' vs 'l')
-
Formatting:
- Does not preserve complex formatting (tables, columns, etc.)
- No font, color, or style information retained
-
Language Support:
- Defaults to English only
- Additional languages require language pack installation
-
Handwriting:
- Not optimized for handwritten text
- Best results with printed or digital text
-
Single Image:
- Processes one image at a time
- No batch processing capability
-
Simple Output:
- Plain text only (no PDF, Word, or formatted output)
Issue: "pytesseract.pytesseract.TesseractNotFoundError"
- Cause: Tesseract OCR is not installed or not in PATH
- Solution:
# macOS brew install tesseract # Verify installation tesseract --version
Issue: Poor OCR accuracy
- Solution:
- Improve image resolution (use 300+ DPI)
- Increase image contrast
- Ensure text is not rotated or skewed
- Clean up image (remove noise, enhance clarity)
- Try image preprocessing before OCR
Issue: "No such file or directory"
- Cause: Image file path is incorrect
- Solution:
- Verify file exists
- Use absolute path or ensure correct relative path
- Check filename spelling and extension
Issue: No text extracted (empty output file)
- Cause: Image doesn't contain recognizable text
- Solution:
- Check image actually contains text
- Verify image is not corrupted
- Try preprocessing image (brightness, contrast, rotation)
- Check if text language matches Tesseract configuration
Issue: ImportError for PIL or pytesseract
- Solution:
pip install Pillow pytesseract
For better results, preprocess images before OCR:
from PIL import Image, ImageEnhance
# Increase contrast
image = Image.open('input.png')
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2.0)
# Convert to grayscale
image = image.convert('L')
# Increase sharpness
enhancer = ImageEnhance.Sharpness(image)
image = enhancer.enhance(2.0)
# Then perform OCR
extracted_text = pytesseract.image_to_string(image)# Scan paper document to image, then extract text
python3 to_text.py scanned_contract.pdf.png# Extract text from application screenshot
python3 to_text.py app_error_message.png# Extract text from form images
python3 to_text.py filled_form.jpg# Convert old image-based documents to searchable text
python3 to_text.py archive_1985_report.tiff| Feature | to_text.py | Adobe Acrobat OCR | Google Cloud Vision | ABBYY FineReader |
|---|---|---|---|---|
| Cost | Free | Paid | Paid (after quota) | Paid |
| Offline | Yes | Yes | No | Yes |
| Simplicity | Very Simple | Complex | Complex | Complex |
| Accuracy | Good | Excellent | Excellent | Excellent |
| Format Support | Images | PDF, Images | Images | PDF, Images, Docs |
| Batch Processing | No | Yes | Yes | Yes |
- Use High-Quality Images: Start with clear, high-resolution images (300+ DPI)
- Preprocess When Needed: Adjust contrast, brightness, or orientation before OCR
- Verify Output: Always review extracted text for accuracy
- Batch Multiple Files: For multiple images, create a shell script to process them
- Choose Right Format: PNG is generally better than JPEG for OCR
- Language Configuration: Configure correct language if text is not in English
- Test First: Try with a small sample before processing large batches
For processing multiple images, create a shell script:
#!/bin/bash
# batch_ocr.sh
for img in *.png; do
echo "Processing $img..."
python3 to_text.py "$img"
done
echo "Batch processing complete!"Usage:
chmod +x batch_ocr.sh
./batch_ocr.shThis tool can be integrated into larger workflows:
Image Source → to_text.py → Text File → Further Processing
↓
(Text Analysis)
(Data Extraction)
(Translation)
(Keyword Search)
- to_m11.py: For processing structured clinical trial documents
- to_timeline.py: For visualizing trial timelines
- to_pj.py: For visit-based visualizations
- Tesseract OCR: https://github.com/tesseract-ocr/tesseract
- Pytesseract Documentation: https://pypi.org/project/pytesseract/
- Pillow Documentation: https://pillow.readthedocs.io/
- OCR Best Practices: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
See LICENSE file for details.
For issues or questions, please refer to the project repository or contact the development team.
- Current: Initial release
- Basic OCR text extraction from images
- Support for common image formats
- Simple command-line interface
- Automatic output file naming
A Python program that converts USDM-conformant Excel workbooks into USDM JSON format. This utility bridges the gap between Excel-based study definitions and the USDM (Unified Study Definitions Model) standard, enabling transformation of tabular data into structured JSON documents.
This tool reads specially formatted Excel workbooks containing clinical study definitions and transforms them into valid USDM JSON files. It leverages the USDM database package to parse Excel sheets, validate data structure, and generate conformant JSON output. The program also produces a CSV file containing any validation errors or warnings encountered during the conversion process.
- Excel to JSON Transformation: Converts USDM-conformant Excel workbooks to USDM JSON
- USDM Compliance: Uses official USDM database package for proper data handling
- Error Tracking: Generates detailed CSV error report with sheet, row, and column references
- Version Information: Displays USDM package version and supported model version
- Automatic Output Naming: Creates output files based on input filename
- Validation: Built-in validation during conversion process
- Simple Interface: Single command execution with minimal configuration
pip install usdm-db
pip install usdm-info- usdm_db (USDMDb): Core USDM database package for Excel parsing and JSON generation
- usdm_info: Provides package and model version information
- Python 3.7+: Required for modern Python features
Basic usage:
python3 from_excel.py <excel_file># Convert Excel workbook to USDM JSON
python3 from_excel.py study_definition.xlsx
# Process file in specific directory
python3 from_excel.py phuse_eu/Excel.xlsx
# The program creates two output files:
# Input: Excel.xlsx
# Output: Excel.json (USDM JSON data)
# Output: Excel.csv (validation errors/warnings)When you run the program, it displays:
Test Utility, using USDM Python Package v1.2.3 supporting USDM version v3.0.0
Output path is: /path/to/directory
JSON: /path/to/directory/Excel.json
CSV: /path/to/directory/Excel.csv
The Excel workbook must be USDM-conformant with specific sheet structures. The exact sheet requirements depend on the USDM database package specifications, but typically include:
Common Sheets:
- Study: Basic study information
- StudyVersion: Version details
- StudyDesign: Design elements
- ScheduleTimeline: Timeline definitions
- Activities: Activity definitions
- Visits: Visit schedules
- Eligibility: Inclusion/exclusion criteria
- And other USDM entities
- File Extension: Must be
.xlsx(Excel 2007+) - Sheet Names: Must match USDM entity names exactly
- Column Headers: Must match USDM attribute names
- Data Types: Must conform to USDM specifications
- References: Must use valid cross-references between entities
Sheet: Study
-----------------------------------------
| id | name | type |
-----------------------------------------
| study_1 | Phase III Trial | ... |
Sheet: StudyDesign
-----------------------------------------
| id | studyId | name |
-----------------------------------------
| design_1 | study_1 | Primary Design |
The program generates a complete USDM JSON file:
{
"study": {
"id": "study_1",
"name": "Phase III Clinical Trial",
"versions": [
{
"id": "version_1",
"studyDesigns": [
{
"id": "design_1",
"scheduleTimelines": [...]
}
]
}
]
}
}Characteristics:
- Properly indented (2 spaces)
- Valid JSON structure
- Complete USDM schema compliance
- UTF-8 encoding
- All cross-references resolved
The program generates a CSV file containing validation issues:
sheet,row,column,message,level
Study,2,name,"Required field missing",ERROR
Activities,5,description,"Value exceeds maximum length",WARNINGCSV Columns:
- sheet: Excel sheet name where issue occurred
- row: Row number (1-indexed)
- column: Column name/identifier
- message: Description of the issue
- level: Severity level (ERROR, WARNING, INFO)
Input: study_definition.xlsx
Output: study_definition.json
Output: study_definition.csv
Input: phuse_eu/Excel.xlsx
Output: phuse_eu/Excel.json
Output: phuse_eu/Excel.csv
def save_as_csv_file(errors, output_path, filename):
# Saves validation errors to CSV file
def save_as_json_file(raw_json, output_path, filename):
# Saves formatted JSON to file
if __name__ == "__main__":
# 1. Parse command-line arguments
# 2. Display version information
# 3. Load Excel using USDMDb
# 4. Convert to JSON
# 5. Save JSON and error files- Parse Arguments: Extract Excel filename from command line
- Path Processing:
- Split path into directory and filename components
- Remove .xlsx extension for output naming
- Determine output directory
- Version Display: Show USDM package and model versions
- Excel Loading: Initialize USDMDb and load Excel workbook
- Conversion: Transform Excel data to USDM JSON structure
- Error Collection: Gather validation errors during conversion
- Output Generation:
- Save formatted JSON file
- Save error report CSV file
The program relies on the USDMDb class from the usdm_db package:
usdm = USDMDb()
errors = usdm.from_excel(full_filename) # Returns list of errors
raw_json = usdm.to_json() # Returns JSON stringKey Methods:
- from_excel(): Parses Excel and returns validation errors
- to_json(): Generates USDM JSON representation
The program tracks various types of issues:
ERROR Level:
- Required fields missing
- Invalid data types
- Constraint violations
- Invalid cross-references
WARNING Level:
- Optional fields missing
- Data quality concerns
- Deprecated attributes
- Format inconsistencies
INFO Level:
- Informational messages
- Processing notes
- Conversion details
sheet,row,column,message,level
Study,1,id,"Duplicate ID found",ERROR
StudyDesign,3,studyId,"Referenced study not found",ERROR
Activities,10,duration,"Invalid ISO 8601 duration format",WARNING
Visits,5,notes,"Field is empty",INFO- Excel Format: Only supports .xlsx files (Excel 2007+), not .xls
- Schema Compliance: Excel structure must strictly conform to USDM schema
- Single File: Processes one workbook at a time (no batch processing)
- Memory: Large Excel files may consume significant memory
- Validation Only: Does not fix errors automatically, only reports them
- Package Dependency: Functionality depends on usdm_db package capabilities
- Version Compatibility: Excel structure must match supported USDM model version
Issue: "No module named 'usdm_db'"
- Cause: USDM database package not installed
- Solution:
pip install usdm-db usdm-info
Issue: Many validation errors in CSV
- Cause: Excel structure doesn't match USDM schema
- Solution:
- Review CSV error report carefully
- Check sheet names match USDM entities
- Verify column headers match USDM attributes
- Ensure data types are correct
- Validate cross-references are complete
Issue: Empty JSON output
- Cause: Critical errors prevented JSON generation
- Solution:
- Review error CSV file
- Fix all ERROR-level issues
- Rerun conversion
Issue: "File not found" error
- Cause: Incorrect file path or missing .xlsx extension
- Solution:
- Verify file exists and path is correct
- Ensure filename includes .xlsx extension
- Use absolute path if needed
Issue: Invalid JSON structure
- Cause: Conversion errors or data inconsistencies
- Solution:
- Check error CSV for structural issues
- Validate Excel data completeness
- Ensure all required entities are present
- Check Versions: Ensure USDM package version matches your Excel structure
- Review CSV First: Always examine error CSV before using JSON
- Incremental Testing: Start with minimal Excel structure and add complexity
- Validate References: Ensure all ID cross-references are valid
- Check Data Types: Verify dates, durations, and numbers are properly formatted
- Start with Example: Use provided example workbooks as templates
- Add Data Incrementally: Build up study definition gradually
- Validate Frequently: Run conversion after each major change
- Document Structure: Keep track of entity relationships
- Review Errors: Address validation errors as they appear
The program is designed to work with example files in the repository:
- phuse_eu/Excel.xlsx: Comprehensive example workbook
- test_data/: Additional test cases
# Convert protocol definition from Excel to USDM JSON
python3 from_excel.py protocol_definition.xlsx# Migrate legacy Excel studies to USDM format
python3 from_excel.py legacy_study.xlsx# Prepare USDM JSON for regulatory submission
python3 from_excel.py submission_study.xlsx# Generate USDM JSON for integration with clinical systems
python3 from_excel.py integrated_study.xlsxThis tool serves as the starting point for the USDM utility workflow:
Excel Workbook → from_excel.py → USDM JSON → Other Tools
↓
to_m11.py → Criteria HTML
↓
to_timeline.py → Timeline HTML
↓
to_visit.py → Visit Tables
↓
to_pj.py → Visit Cards
Workflow Benefits:
- Single source Excel workbook
- Multiple output formats
- Consistent study definition
- Automated generation of visualizations
| Feature | from_excel.py | Manual JSON | Direct Database |
|---|---|---|---|
| Ease of Use | High | Low | Medium |
| Error Prevention | High (validation) | Low | Medium |
| Collaboration | Excellent (Excel) | Poor | Good |
| Version Control | Good | Excellent | Good |
| Learning Curve | Low | High | Medium |
| Data Entry Speed | Fast | Slow | Medium |
- Validate Early: Run conversion frequently during Excel development
- Address Errors: Fix ERROR-level issues immediately
- Review Warnings: Consider WARNING-level issues for data quality
- Backup Files: Keep copies of Excel files before major changes
- Document Structure: Maintain documentation of your Excel structure
- Use Templates: Start with validated example workbooks
- Test Iterations: Test conversion with small changes incrementally
- Version Management: Track both Excel and generated JSON versions
For processing multiple Excel files:
#!/bin/bash
# batch_convert.sh
for excel in *.xlsx; do
echo "Converting $excel..."
python3 from_excel.py "$excel"
echo "---"
done
echo "Batch conversion complete!"Usage:
chmod +x batch_convert.sh
./batch_convert.shUsing from_excel.py functionality in Python scripts:
from usdm_db import USDMDb
import json
# Load Excel file
usdm = USDMDb()
errors = usdm.from_excel('study.xlsx')
# Check for critical errors
critical_errors = [e for e in errors if e['level'] == 'ERROR']
if critical_errors:
print(f"Found {len(critical_errors)} critical errors!")
for error in critical_errors:
print(f" {error['sheet']}, row {error['row']}: {error['message']}")
else:
# Generate JSON
raw_json = usdm.to_json()
json_obj = json.loads(raw_json)
# Process JSON as needed
print(f"Study: {json_obj['study']['name']}")The program displays version information at runtime:
Test Utility, using USDM Python Package v1.2.3 supporting USDM version v3.0.0
Important Notes:
- Excel structure must match the supported USDM model version
- Package updates may change supported features
- Always check compatibility before conversion
- Refer to usdm_db package documentation for version details
- to_m11.py: Converts USDM JSON to M11 criteria HTML
- to_timeline.py: Generates timeline visualizations from USDM JSON
- to_visit.py: Creates visit detail tables from USDM JSON
- to_pj.py: Produces visit activity card visualizations
- USDM Specification: https://www.cdisc.org/standards/usdm
- CDISC Standards: https://www.cdisc.org/
- Python Package Index: https://pypi.org/ (search for usdm-db)
- Excel USDM Templates: Check repository examples in phuse_eu/
See LICENSE file for details.
For issues or questions, please refer to the project repository or contact the development team.
- Current: Initial release
- Excel to USDM JSON conversion
- Validation error reporting (CSV format)
- Support for USDM-conformant Excel workbooks
- Version information display
- Automatic output file naming
A Python utility that renames image files based on their metadata timestamps.
This script renames JPG and PNG image files in a directory to a standardized format: YYYY-MM-DD_<index>.<ext>, where <index> is the chronological order of images taken on the same day.
- Reads EXIF metadata from images (DateTimeOriginal or DateTime fields)
- Falls back to file modification time if no EXIF data is available
- Groups images by date and assigns chronological indices
- Supports both JPG and PNG formats (case-insensitive)
- Includes dry-run mode to preview changes before applying
- Handles duplicate filenames and edge cases gracefully
- Python 3.6+
- Pillow library
Install Pillow if not already installed:
pip install PillowRename all images in a directory:
python rename_images.py /path/to/imagesPreview what changes would be made without actually renaming files:
python rename_images.py /path/to/images --dry-runRename images in the current directory:
python rename_images.py .vacation/
├── IMG_1234.jpg (taken 2024-07-15 10:30:00)
├── IMG_1235.jpg (taken 2024-07-15 14:20:00)
├── IMG_1236.jpg (taken 2024-07-16 09:15:00)
└── photo.png (taken 2024-07-16 16:45:00)
vacation/
├── 2024-07-15_1.jpg
├── 2024-07-15_2.jpg
├── 2024-07-16_1.jpg
└── 2024-07-16_2.png
- Scans Directory: Finds all
.jpg,.jpeg, and.pngfiles (case-insensitive) - Extracts Timestamps: Reads EXIF metadata (DateTimeOriginal or DateTime)
- Fallback: Uses file modification time if no EXIF data exists
- Groups by Date: Organizes images by the date they were taken
- Assigns Indices: Numbers images chronologically within each day
- Renames Files: Updates filenames to
YYYY-MM-DD_<index>.<ext>format
- No EXIF Data: Falls back to file modification timestamp
- Duplicate Names: Skips renaming if target filename already exists
- Already Correct: Skips files that already have the correct name
- Missing Timestamps: Warns and skips files with no available timestamp
- Mixed Case Extensions: Handles
.jpg,.JPG,.jpeg,.JPEG,.png,.PNG
usage: rename_images.py [-h] [--dry-run] directory
Rename image files based on metadata timestamps
positional arguments:
directory Directory containing image files to rename
optional arguments:
-h, --help show this help message and exit
--dry-run Show what would be renamed without actually renaming files
The script provides detailed feedback:
- Shows total number of images found
- Reports each rename operation or reason for skipping
- Displays final summary of renamed and skipped files
- Warns about any errors or issues encountered
- The script only processes files in the specified directory (non-recursive)
- Original file extensions are preserved
- The script will not overwrite existing files
- Always use
--dry-runfirst to preview changes - EXIF data is preferred over file modification time for accuracy
- Use
--dry-runto preview changes before committing - The script will never overwrite an existing file
- Files are renamed, not copied, so no disk space is wasted
- Original files are not modified, only renamed