An Active Learning Framework for Simultaneous Optimization of Thermodynamic Efficiency and Inherent Safety in Extractive Distillation Entrainers
This research framework implements a "Safety-by-Design" approach to entrainer selection for ethanol-water separation. Unlike traditional methods that optimize efficiency first and apply safety as a retroactive constraint, this framework treats safety and efficiency as simultaneous objectives within a Multi-Objective Bayesian Optimization (MOBO) loop.
In industrial ethanol-water separation via extractive distillation:
- Traditional approach: Maximize efficiency first β Apply safety constraints later
- Result: Selection of hazardous solvents (e.g., benzene - a known carcinogen)
- Consequence: Expensive containment and mitigation strategies
A five-phase computational pipeline that:
- Maps the chemical space to identify promising molecular "hot spots"
- Selects candidates using three parallel AI/algorithmic engines
- Expands the search via graph-based molecular similarity traversal
- Optimizes simultaneously for safety and efficiency using MOBO + qEHVI
- Validates rigorously through process simulation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENTRAINER SELECTION FRAMEWORK β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Phase I: Domain Mapping Phase II: Multi-Vector Selection β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββ
β β Literature Survey β β Engine A Engine B Engine C ββ
β β Database Scoping ββββββββββΆβ Graph-RAG TRIZ RDKit ββ
β β Cluster Definition β β (AI) (Heuristic)(Algorithmic)ββ
β β 100K+ β 500 clustersβ β ββββββββββ¬ββββββββββ ββ
β βββββββββββββββββββββββ ββββββββββββββββΌββββββββββββββββββββββββ
β β β
β βΌ β
β Phase III: Deep Traversal Phase IV: Bayesian Optimization β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β Neo4j Graph DB β β Gaussian Process Surrogate β β
β β Similarity ExpansionββββββββββΆβ qEHVI Acquisition Function β β
β β 75-150 β 150-300 β β Pareto Frontier Identification β β
β βββββββββββββββββββββββ ββββββββββββββββ¬βββββββββββββββββββ β
β β β
β βΌ β
β Phase V: Simulation & Validation β
β βββββββββββββββββββββββββββββββββββ β
β β DWSIM Process Simulation β β
β β Final Top 10 Ranking β β
β β Pareto-Optimal Library Output β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| ID | Hypothesis | Validation Phase | Success Metric |
|---|---|---|---|
| H1 | Pareto frontier exhibits convex structure with identifiable knee points | Phase IV | β₯1 knee point identified |
| H2 | qEHVI achieves equivalent hypervolume with β€30% computational budget | Phase IV | HV_30% β₯ 0.95 Γ HV_100% |
| H3 | Consensus safety scoring reduces uncertainty by β₯25% | Phase II | Ο_reduction β₯ 25% |
entrainer-selection/
βββ src/
β βββ core/ # Shared infrastructure (config, logging, models)
β βββ phase1/ # Domain Mapping & Cluster Definition
β βββ phase2/
β β βββ engine_a/ # Graph-RAG with Gemini
β β βββ engine_b/ # TRIZ Multi-Agent System
β β βββ engine_c/ # Cheminformatics & Diversity
β βββ phase3/ # Graph Traversal & Expansion
β βββ phase4/ # MOBO & Active Learning
β βββ phase5/ # DWSIM Simulation & Validation
βββ config/
β βββ infra_config.yaml # Database, API, logging settings
β βββ science_config.yaml # SMARTS patterns, thresholds, thermodynamics
βββ data/
β βββ raw/ # API query results
β βββ processed/ # Cleaned datasets
β βββ results/ # Output from each phase
βββ notebooks/ # Jupyter notebooks for exploration
βββ tests/ # Unit and integration tests
βββ docs/ # Extended documentation
β βββ phases/ # Detailed phase documentation
βββ backlog/ # Development task tracking
βββ scripts/ # Utility scripts
- Python 3.11+
- Neo4j Community Edition (for Graph-RAG)
- DWSIM (for Phase V simulation - Windows only)
# Clone the repository
git clone https://github.com/yourusername/entrainer-selection.git
cd entrainer-selection
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]"
# Set up configuration
cp config/infra_config.example.yaml config/infra_config.yaml
# Edit config files with your API keys and database credentials# Required API Keys
export GOOGLE_API_KEY="your-gemini-api-key"
export PUBCHEM_API_KEY="optional-for-higher-rate-limits"
# Database Configuration
export NEO4J_URI="bolt://localhost:7687"
export NEO4J_USER="neo4j"
export NEO4J_PASSWORD="your-password"| Document | Description |
|---|---|
| TECH_STACK.md | Complete technology stack and dependencies |
| ARCHITECTURE.md | System architecture and data flow |
| Phase Documentation | Detailed documentation for each phase |
| CONTRIBUTING.md | Contribution guidelines |
| API Reference | Module and function documentation |
| Category | Technologies |
|---|---|
| Core Language | Python 3.11+ |
| Cheminformatics | RDKit, PubChemPy |
| Machine Learning | BoTorch, GPyTorch, PyTorch |
| Graph Database | Neo4j + ChromaDB |
| LLM Integration | Google Gemini API |
| Process Simulation | DWSIM (COM automation) |
| Thermodynamics | thermo (UNIFAC) |
- Pareto-Optimal Library: High-dimensional dataset identifying "Knee Points" - optimal safety/efficiency trade-offs
- Quantifiable Metrics: 20% reduction in inherent risk with <8% efficiency penalty
- Reproducible Workflow: Dockerized pipeline adaptable to other separation problems
- Benchmark Comparison: Validated against ethylene glycol (industry standard) and benzene (historical negative control)
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see LICENSE for details.
- Altshuller, G. (1999). The Innovation Algorithm: TRIZ
- Laroche, L. et al. (1991). "Homogeneous Azeotropic Distillation" [DOI: 10.1021/ie00020a013]
- Perry's Chemical Engineers' Handbook, 9th Edition
- BoTorch Multi-Objective Optimization: botorch.org
For questions or collaboration inquiries, please open an issue or contact the maintainers.