[Feature]: Automates llm_script performance verification by building a different LLM call-based evaluation system

### Current Situation/Problem and Proposal

Currently, llm_script generates responses by invoking various prompts according to its own logic. However, because there’s no evaluation framework to systematically compare and analyze model-to-model performance differences, it’s hard to pinpoint clear areas for improvement.

🎯 Objectives

Leverage external LLMs (GPT-4, Claude, Llama-2, etc.) as evaluators

Automate the process of having those evaluators review and score the responses produced by our LLM

Collect and visualize quality metrics (accuracy, consistency, fluency, etc.) to drive a concrete improvement roadmap

🔍 Proposal

API module extension

Add a separate call class for each evaluator LLM under script/llm

Common interface

Define evaluate(prompt, response) → score

Evaluation script

Implement a script that iterates through prompts, collects scores, and outputs results

Metrics design

Quantitative: consistency, factuality, response latency

Reporting: periodically publish evaluation reports on the itdoc.kr blog. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Automates llm_script performance verification by building a different LLM call-based evaluation system #211

Current Situation/Problem and Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Automates llm_script performance verification by building a different LLM call-based evaluation system #211

Description

Current Situation/Problem and Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions