Skip to content

Conversation

@jgoedeke
Copy link
Contributor

@jgoedeke jgoedeke commented Dec 15, 2025

This PR introduces memory-mapped file handling and a streaming API to enable efficient processing of large IMC files without loading the entire dataset into memory.

Key Changes:

  • Core Library: Replaced std::vector file buffering with a MemoryMappedFile implementation to significantly reduce memory usage.
  • Python Bindings:
    • Added iter_channel_numpy() to the imctermite class. This generator yields data in chunks as NumPy arrays, supporting both "scaled" (physical units) and "raw" (native integer) modes.
    • Added type stubs (.pyi) and package data for improved IDE support.
  • Performance: Implemented efficient memcpy transfer from C++ buffers to NumPy arrays.
  • Fixes: Improved text encoding/codepage handling for Windows.
  • Documentation: Added usage_numpy_chunks.py and usage_timerange.py to demonstrate new functionality.
  • CI: Added Windows target to CI tests.

Performance Improvements:

Benchmarks comparing the new memory-mapped implementation against the baseline show significant improvements in both memory efficiency and load times:

  • Memory Usage: Reduced memory footprint by ~50% for large files.
    • 100MB file: Overhead reduced from ~204MB to ~101MB.
    • Small file: Overhead reduced from ~1.3MB to ~92KB.
  • Loading Speed: File loading is approximately 5x faster for large files.
    • 100MB file: Load time decreased from ~1.25s to ~0.24s.
    • Small file: Load time decreased from ~2.6ms to ~0.2ms.

NOTE: This PR is based on #37

Closes #33
Closes #9

- Add pytest configuration for test management
- Implement test suite for CLI functionality and Python module
- Update README with testing instructions and badge
- Fix Dockerfile
- Create .dockerignore to exclude unnecessary files from Docker builds
- Add GitHub Actions workflows for testing
- Clean up makefile to include test commands
@jgoedeke jgoedeke force-pushed the numpy-streaming branch 3 times, most recently from 86df1f2 to d416f25 Compare December 16, 2025 20:17
jgoedeke and others added 5 commits December 16, 2025 21:29
- Modified `component_group` and `channel` constructors to accept raw buffer pointers instead of vectors.
- Enhanced `load_all_data` and `init_metadata` methods for better data initialization and loading.
- Implemented `read_chunk` method in `channel` to facilitate chunked data reading with support for raw and scaled modes.
- Updated `convert_data_to_type` and `convert_chunk_to_double` functions to handle raw data more efficiently.
- Removed redundant `imc_result.hpp` file to streamline the codebase.
- Adjusted Python bindings in `imctermite.pyx` to manage C++ instance memory correctly.
- Update GitHub Actions workflow to support testing on multiple OS
- Refactor memory mapping in imc_buffer.hpp for Windows compatibility
- Improve makefile to handle .pyd files for Python builds
- Add comprehensive tests for streaming and chunking functionality in test_streaming.py
…ficiency (RecordEvolution#9)

Reduces memory usage by 90% for large datasets while maintaining comparable processing speed.
@jgoedeke jgoedeke changed the title Add memory mapping and NumPy streaming support Memory-mapped file handling and streaming API Dec 16, 2025
@jgoedeke jgoedeke marked this pull request as ready for review December 16, 2025 21:53
@jgoedeke jgoedeke marked this pull request as draft December 17, 2025 08:29
@jgoedeke
Copy link
Contributor Author

I just realized the new numpy dependency was not correctly handled and that makes the PR a major change. So I took the opportunity to modernize the Python packaging infrastructure and CI workflows to follow current best practices and support Python 3.10-3.13.

Changes

Packaging

  • Migrated package metadata from setup.cfg to pyproject.toml (PEP 621)
  • Replaced deprecated setup.py commands with python -m build
  • Added numpy as explicit dependency (>=1.26.0)
  • Version bumped to 3.0.0
  • Simplified setup.py to minimal Cython extension builder

CI/CD

  • Updated GitHub Actions from v2 → v4
  • Replaced ubuntu-24.04 with ubuntu-latest for better maintainability
  • Removed obsolete cibuildwheel==2.1.2 version pinning
  • Configured cibuildwheel via pyproject.toml with proper platform targeting
  • Streamlined dependency installation in test workflows

Documentation

  • Added Python version support badge to README
  • Enhanced test documentation with pip editable install instructions
  • Added note about numpy dependency requirement

Build System

  • Updated makefiles to use python3 consistently
  • Simplified build targets using modern pip commands

Testing

All existing tests pass. CI workflows validated on Ubuntu and Windows with Python 3.10-3.13.

@jgoedeke jgoedeke marked this pull request as ready for review December 17, 2025 08:56
- Migrate from setup.cfg to pyproject.toml with PEP 517/621 compliance
- Update to Python build tools (replace setup.py commands with python -m build)
- Upgrade all GitHub Actions to latest versions (@v4, ubuntu-latest)
- Remove outdated cibuildwheel version pinning
- Add numpy as explicit build and runtime dependency
- Bump package version to 3.0.0
- Improve test documentation with development install guidance
- Add Python version badge to README
- Standardize python3 usage across makefiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loading single channels optimize file export

1 participant