Skip to content

[Feature]: Automates llm_script performance verification by building a different LLM call-based evaluation system #211

@wnghdcjfe

Description

@wnghdcjfe

Current Situation/Problem and Proposal

Currently, llm_script generates responses by invoking various prompts according to its own logic. However, because there’s no evaluation framework to systematically compare and analyze model-to-model performance differences, it’s hard to pinpoint clear areas for improvement.

🎯 Objectives

Leverage external LLMs (GPT-4, Claude, Llama-2, etc.) as evaluators

Automate the process of having those evaluators review and score the responses produced by our LLM

Collect and visualize quality metrics (accuracy, consistency, fluency, etc.) to drive a concrete improvement roadmap

🔍 Proposal

API module extension

Add a separate call class for each evaluator LLM under script/llm

Common interface

Define evaluate(prompt, response) → score

Evaluation script

Implement a script that iterates through prompts, collects scores, and outputs results

Metrics design

Quantitative: consistency, factuality, response latency

Reporting: periodically publish evaluation reports on the itdoc.kr blog.

Metadata

Metadata

Assignees

Labels

enhancementWork focused on refining or upgrading existing features

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions