Current Situation/Problem and Proposal
Currently, llm_script generates responses by invoking various prompts according to its own logic. However, because there’s no evaluation framework to systematically compare and analyze model-to-model performance differences, it’s hard to pinpoint clear areas for improvement.
🎯 Objectives
Leverage external LLMs (GPT-4, Claude, Llama-2, etc.) as evaluators
Automate the process of having those evaluators review and score the responses produced by our LLM
Collect and visualize quality metrics (accuracy, consistency, fluency, etc.) to drive a concrete improvement roadmap
🔍 Proposal
API module extension
Add a separate call class for each evaluator LLM under script/llm
Common interface
Define evaluate(prompt, response) → score
Evaluation script
Implement a script that iterates through prompts, collects scores, and outputs results
Metrics design
Quantitative: consistency, factuality, response latency
Reporting: periodically publish evaluation reports on the itdoc.kr blog.