Skip to content

A Safety-by-Design framework for entrainer selection. Uses Multi-Objective Bayesian Optimization (MOBO) to simultaneously maximize distillation efficiency and inherent safety.

License

Notifications You must be signed in to change notification settings

Khanna-Aman/Entrainer_Selection_Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ͺ Safety-by-Design Framework for Ethanol-Water Separation Entrainer Selection

Python 3.11+ License: MIT Code style: black

An Active Learning Framework for Simultaneous Optimization of Thermodynamic Efficiency and Inherent Safety in Extractive Distillation Entrainers

🎯 Project Overview

This research framework implements a "Safety-by-Design" approach to entrainer selection for ethanol-water separation. Unlike traditional methods that optimize efficiency first and apply safety as a retroactive constraint, this framework treats safety and efficiency as simultaneous objectives within a Multi-Objective Bayesian Optimization (MOBO) loop.

The Problem

In industrial ethanol-water separation via extractive distillation:

  • Traditional approach: Maximize efficiency first β†’ Apply safety constraints later
  • Result: Selection of hazardous solvents (e.g., benzene - a known carcinogen)
  • Consequence: Expensive containment and mitigation strategies

Our Solution

A five-phase computational pipeline that:

  1. Maps the chemical space to identify promising molecular "hot spots"
  2. Selects candidates using three parallel AI/algorithmic engines
  3. Expands the search via graph-based molecular similarity traversal
  4. Optimizes simultaneously for safety and efficiency using MOBO + qEHVI
  5. Validates rigorously through process simulation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ENTRAINER SELECTION FRAMEWORK                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  Phase I: Domain Mapping          Phase II: Multi-Vector Selection      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚ Literature Survey   β”‚         β”‚  Engine A    Engine B   Engine C    β”‚β”‚
β”‚  β”‚ Database Scoping    │────────▢│  Graph-RAG   TRIZ      RDKit        β”‚β”‚
β”‚  β”‚ Cluster Definition  β”‚         β”‚  (AI)       (Heuristic)(Algorithmic)β”‚β”‚
β”‚  β”‚ 100K+ β†’ 500 clustersβ”‚         β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                 β”‚                       β”‚
β”‚                                                 β–Ό                       β”‚
β”‚  Phase III: Deep Traversal       Phase IV: Bayesian Optimization        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Neo4j Graph DB      β”‚         β”‚  Gaussian Process Surrogate     β”‚    β”‚
β”‚  β”‚ Similarity Expansion│────────▢│  qEHVI Acquisition Function    β”‚    β”‚
β”‚  β”‚ 75-150 β†’ 150-300    β”‚         β”‚  Pareto Frontier Identification β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                 β”‚                       β”‚
β”‚                                                 β–Ό                       β”‚
β”‚                           Phase V: Simulation & Validation              β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚                           β”‚  DWSIM Process Simulation       β”‚           β”‚
β”‚                           β”‚  Final Top 10 Ranking           β”‚           β”‚
β”‚                           β”‚  Pareto-Optimal Library Output  β”‚           β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Research Hypotheses

ID Hypothesis Validation Phase Success Metric
H1 Pareto frontier exhibits convex structure with identifiable knee points Phase IV β‰₯1 knee point identified
H2 qEHVI achieves equivalent hypervolume with ≀30% computational budget Phase IV HV_30% β‰₯ 0.95 Γ— HV_100%
H3 Consensus safety scoring reduces uncertainty by β‰₯25% Phase II Οƒ_reduction β‰₯ 25%

πŸ“ Project Structure

entrainer-selection/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ core/              # Shared infrastructure (config, logging, models)
β”‚   β”œβ”€β”€ phase1/            # Domain Mapping & Cluster Definition
β”‚   β”œβ”€β”€ phase2/
β”‚   β”‚   β”œβ”€β”€ engine_a/      # Graph-RAG with Gemini
β”‚   β”‚   β”œβ”€β”€ engine_b/      # TRIZ Multi-Agent System
β”‚   β”‚   └── engine_c/      # Cheminformatics & Diversity
β”‚   β”œβ”€β”€ phase3/            # Graph Traversal & Expansion
β”‚   β”œβ”€β”€ phase4/            # MOBO & Active Learning
β”‚   └── phase5/            # DWSIM Simulation & Validation
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ infra_config.yaml  # Database, API, logging settings
β”‚   └── science_config.yaml # SMARTS patterns, thresholds, thermodynamics
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/               # API query results
β”‚   β”œβ”€β”€ processed/         # Cleaned datasets
β”‚   └── results/           # Output from each phase
β”œβ”€β”€ notebooks/             # Jupyter notebooks for exploration
β”œβ”€β”€ tests/                 # Unit and integration tests
β”œβ”€β”€ docs/                  # Extended documentation
β”‚   └── phases/            # Detailed phase documentation
β”œβ”€β”€ backlog/               # Development task tracking
└── scripts/               # Utility scripts

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Neo4j Community Edition (for Graph-RAG)
  • DWSIM (for Phase V simulation - Windows only)

Installation

# Clone the repository
git clone https://github.com/yourusername/entrainer-selection.git
cd entrainer-selection

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e ".[dev]"

# Set up configuration
cp config/infra_config.example.yaml config/infra_config.yaml
# Edit config files with your API keys and database credentials

Environment Variables

# Required API Keys
export GOOGLE_API_KEY="your-gemini-api-key"
export PUBCHEM_API_KEY="optional-for-higher-rate-limits"

# Database Configuration
export NEO4J_URI="bolt://localhost:7687"
export NEO4J_USER="neo4j"
export NEO4J_PASSWORD="your-password"

πŸ“– Documentation

Document Description
TECH_STACK.md Complete technology stack and dependencies
ARCHITECTURE.md System architecture and data flow
Phase Documentation Detailed documentation for each phase
CONTRIBUTING.md Contribution guidelines
API Reference Module and function documentation

🧬 Key Technologies

Category Technologies
Core Language Python 3.11+
Cheminformatics RDKit, PubChemPy
Machine Learning BoTorch, GPyTorch, PyTorch
Graph Database Neo4j + ChromaDB
LLM Integration Google Gemini API
Process Simulation DWSIM (COM automation)
Thermodynamics thermo (UNIFAC)

πŸ“Š Expected Outcomes

  1. Pareto-Optimal Library: High-dimensional dataset identifying "Knee Points" - optimal safety/efficiency trade-offs
  2. Quantifiable Metrics: 20% reduction in inherent risk with <8% efficiency penalty
  3. Reproducible Workflow: Dockerized pipeline adaptable to other separation problems
  4. Benchmark Comparison: Validated against ethylene glycol (industry standard) and benzene (historical negative control)

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

πŸ“œ License

This project is licensed under the MIT License - see LICENSE for details.

πŸ“š References

  • Altshuller, G. (1999). The Innovation Algorithm: TRIZ
  • Laroche, L. et al. (1991). "Homogeneous Azeotropic Distillation" [DOI: 10.1021/ie00020a013]
  • Perry's Chemical Engineers' Handbook, 9th Edition
  • BoTorch Multi-Objective Optimization: botorch.org

πŸ“§ Contact

For questions or collaboration inquiries, please open an issue or contact the maintainers.


Status: 🚧 Active Development | Current Phase: Infrastructure Setup

About

A Safety-by-Design framework for entrainer selection. Uses Multi-Objective Bayesian Optimization (MOBO) to simultaneously maximize distillation efficiency and inherent safety.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages