TrustScoreEval is a comprehensive, high-performance, and modular platform for AI evaluation, hallucination detection, data quality, and trustworthy AI lifecycle management. It integrates:
- LLM Management (dynamic loading, fine-tuning, evaluation, deployment)
- Data Engineering (ETL, trust scoring, validation, analytics)
- Security (auth, monitoring, secrets, OAuth/SAML)
- Research Platform (experiments, use cases, analysis, project management)
- Unified WebUI for all managers and workflows
- Super-fast async API for all components
- Python 3.8+
- pip (Python package manager)
- Git
- [Optional] Node.js (for advanced dashboard features)
bash git clone https://github.com/dataaispark-spec/TrustScoreEval.git cd TrustScoreEval python -m venv .venv .venv\Scripts\activate # On Windows
source .venv/bin/activate # On Linux/Mac pip install -r requirements.txt
## 🌐 Launch the Unified WebUI
The WebUI provides a single interface for LLM, Data, Security, and Research management.
bash
streamlit run launch_workflow_webui.py
- Open http://localhost:8501 in your browser.
- All managers and dashboards are available from the sidebar.
The API server exposes all async endpoints for programmatic access.
python superfast_production_server.py- The server runs at http://localhost:8003
- Health check: http://localhost:8003/health
- Start the API server (
python superfast_production_server.py) - Start the WebUI (
streamlit run launch_workflow_webui.py) - Upload or create datasets (Data Manager)
- Run ETL, trust scoring, and validation (Data Manager)
- Manage and deploy LLMs (LLM Manager)
- Configure security, users, and secrets (Security Manager)
- Create research use cases and experiments (Research Platform)
- Analyze results, generate reports, and monitor system (WebUI dashboards)
- LLM Manager: CRUD, fine-tune, evaluate, deploy, batch ops (
/llm/*) - Data Manager: Dataset CRUD, trust scoring, ETL, batch ops (
/data/*) - Security Manager: Auth, user mgmt, monitoring, secrets, OAuth/SAML (
/security/*) - Research Platform: Use cases, experiments, analysis, project mgmt (
/research/*) - Unified WebUI: All managers, dashboards, and workflows in one place
- Launch:
streamlit run launch_workflow_webui.py - Use sidebar to:
- Upload data, run trust scoring, view analytics
- Manage LLMs (load, fine-tune, deploy)
- Configure security, users, secrets
- Create and run research experiments
- Health:
GET /health - List datasets:
GET /data/datasets/list - Create dataset:
POST /data/datasets/create - Trust score:
POST /data/datasets/{dataset_id}/trust-score - LLM health:
GET /llm/health - Security health:
GET /security/health - Research health:
GET /research/health
- Missing dependencies? Run
pip install -r requirements.txtagain. - Port in use? Change the port in the server script or stop the conflicting process.
- Module not found? Ensure your virtual environment is activated.
- Windows line endings warning? Safe to ignore, or run
git config --global core.autocrlf true. - API 404? Make sure the server is running and the endpoint path is correct.
- WebUI not loading? Check for errors in the terminal and ensure Streamlit is installed.
- See
DEEP_INTEGRATION_ANALYSIS.mdfor cross-component workflows - See
COMPLETE_WORKFLOW_SUMMARY.mdfor end-to-end diagnostic flows - See
README_UNIFIED_WEBUI.mdfor WebUI usage
Let me break down the fundamental differences and explain why a trust-based approach is more comprehensive and valuable.
Focus: Data Quality → Model Performance
Approach: Find and fix problems in training data
Scope: Limited to labeled dataset issues
Outcome: Better training data
Focus: Holistic System Reliability → Real-World Performance
Approach: Evaluate comprehensive trustworthiness
Scope: End-to-end system behavior including deployment
Outcome: Confidence in system behavior
Label Error Detection Limitations:
# CleanLab approach - focused on training data
def cleanlab_approach(training_data, labels):
# Only addresses:
# 1. Mislabeling in training data
# 2. Data quality issues
# 3. Confidence in training predictions
label_issues = find_label_errors(labels, pred_probs)
cleaned_data = remove_label_issues(training_data, label_issues)
return cleaned_data # Better training data, but...
# What about deployment behavior? Real-world performance?
# These are NOT addressed by label error detection aloneTrust-Based Approach:
# OpenTrustEval approach - comprehensive trust evaluation
def trust_based_approach(model, training_data, test_data, production_data):
trust_assessment = {
# Training Data Quality (includes label error detection)
'data_quality': evaluate_data_quality(training_data, labels),
# Model Reliability
'reliability': evaluate_reliability(model, test_data),
# Consistency Across Inputs
'consistency': evaluate_consistency(model, various_inputs),
# Fairness and Bias
'fairness': evaluate_fairness(model, diverse_test_cases),
# Robustness to Adversarial Attacks
'robustness': evaluate_robustness(model, adversarial_examples),
# Explainability and Transparency
'explainability': evaluate_explainability(model, inputs),
# Production Behavior
'deployment_trust': evaluate_production_behavior(model, production_data)
}
return comprehensive_trust_score(trust_assessment)The Fundamental Problem:
# Scenario: Perfect training data, poor real-world trust
class ExampleScenario:
def demonstrate_limitation(self):
# Training data is perfect (no label errors)
training_data_quality = 0.99 # CleanLab would be happy
# But model has issues:
reliability_score = 0.6 # Unreliable predictions
consistency_score = 0.5 # Inconsistent responses
fairness_score = 0.4 # Biased decisions
robustness_score = 0.3 # Fragile to input changes
# Label error detection says: "Data is clean!"
# Trust system says: "Don't deploy this - it's not trustworthy!"
return {
'cleanlab_assessment': 'Data quality excellent',
'trust_assessment': 'System not ready for deployment'
}Label Error Detection Cannot Address:
# Issues that arise over time and context
def temporal_trust_challenges():
return {
# Time-based issues (CleanLab can't detect):
'concept_drift': 'Model performance degrades as world changes',
'data_drift': 'Input distribution shifts in production',
'model_degradation': 'Performance naturally degrades over time',
# Context-based issues:
'domain_adaptation': 'Works in training domain but fails in deployment domain',
'edge_cases': 'Handles common cases but fails on edge cases',
'user_trust': 'Users lose confidence due to inconsistent behavior'
}Trust systems evaluate:
def comprehensive_risk_assessment():
return {
# Pre-deployment risks (partially covered by CleanLab)
'training_data_risks': ['label_errors', 'bias', 'completeness'],
# Model behavior risks (NOT covered by CleanLab)
'behavioral_risks': [
'overconfidence', # Model too confident in wrong answers
'inconsistency', # Different responses to similar inputs
'adversarial_vulnerability', # Security risks
'bias_amplification' # Fairness issues in deployment
],
# Deployment risks (NOT covered by CleanLab)
'deployment_risks': [
'production_drift', # Performance degradation over time
'user_acceptance', # Human trust and adoption
'regulatory_compliance', # Legal and ethical requirements
'business_impact' # Real-world consequences of failures
]
}Beyond Data Quality:
def decision_making_support():
# CleanLab helps answer: "Is my training data good?"
cleanlab_question = "Should I retrain with cleaned data?"
# Trust systems help answer broader questions:
trust_questions = [
"Should I deploy this model to production?",
"Can I trust this model's decisions in critical situations?",
"How will this model perform with real users?",
"What are the risks of deploying this system?",
"How can I improve overall system trustworthiness?"
]
return {
'cleanlab_scope': cleanlab_question,
'trust_scope': trust_questions
}Evolution Over Time:
def evolution_comparison():
return {
'label_error_detection': {
'phase': 'Training/pre-deployment',
'frequency': 'One-time or periodic retraining',
'scope': 'Static training dataset',
'outcome': 'Better training data'
},
'trust_based_system': {
'phase': 'End-to-end lifecycle (training → deployment → monitoring)',
'frequency': 'Continuous monitoring',
'scope': 'Dynamic system behavior in real-world conditions',
'outcome': 'Confidence in system reliability and safety'
}
}# CleanLab approach:
medical_model_cleanlab = {
'training_data_quality': 0.98, # Very clean data
'recommendation': 'Ready for deployment'
}
# Trust-based approach:
medical_model_trust = {
'training_data_quality': 0.98, # Same clean data
'reliability_score': 0.7, # Sometimes confident when wrong
'consistency_score': 0.6, # Different diagnoses for similar symptoms
'robustness_score': 0.5, # Fragile to slight input variations
'fairness_score': 0.8, # Good but not perfect
'explainability_score': 0.4, # Poor explanations for decisions
'overall_trust': 0.6, # NOT ready for deployment!
'recommendation': 'Needs significant improvement before deployment'
}# CleanLab approach:
av_perception_cleanlab = {
'training_data_quality': 0.95, # Good object detection labels
'recommendation': 'Good data quality'
}
# Trust-based approach:
av_perception_trust = {
'training_data_quality': 0.95, # Same good data
'reliability_in_rain': 0.3, # Terrible in rain conditions
'consistency_at_night': 0.4, # Inconsistent night performance
'robustness_to_adversarial': 0.2, # Vulnerable to simple attacks
'edge_case_handling': 0.3, # Fails on unusual scenarios
'safety_trust': 0.3, # DANGEROUS for deployment!
'recommendation': 'Absolutely not ready - safety risks too high'
}Data Quality: Good/Bad → Retrain/Don't Retrain
Trust Dimensions:
├── Reliability: 0.7 (Moderate confidence)
├── Consistency: 0.6 (Some variability acceptable)
├── Fairness: 0.9 (Excellent)
├── Robustness: 0.4 (Needs improvement)
├── Explainability: 0.8 (Good)
└── Overall Trust: 0.6 (Improvement needed)
Decision Matrix:
├── Critical Applications: DON'T DEPLOY
├── Low-Stakes Applications: DEPLOY with monitoring
└── Research Applications: DEPLOY with caveats
Perfect training data ≠ Trustworthy system
A trust-based system recognizes that:
- Data quality is necessary but not sufficient for trustworthy AI
- Model behavior in deployment matters more than training data quality
- Human trust and acceptance are crucial for real-world success
- Continuous monitoring and improvement are essential for long-term success
Trust-based systems are superior because they:
- Provide comprehensive assessment beyond just data quality
- Support better decision-making for real-world deployment
- Consider end-to-end system behavior rather than isolated components
- Enable continuous improvement throughout the AI lifecycle
- Address human factors like user trust and acceptance
- Prepare for real-world complexity rather than controlled environments
While label error detection is valuable (and should be part of any comprehensive approach), it's only one piece of the much larger trust puzzle. A trust-based system provides the holistic view needed to build truly reliable, safe, and successful AI systems.
TrustScoreEval is developed and maintained by Kumar Sivarajan and contributors. For issues, feature requests, or contributions, please open an issue or pull request on GitHub.