Refactor evaluation logic

Currently evaluation is performed using multiple scripts that share part of the logic (e.g., loading the dialogues and processing them).

Suggestions:
- Create class for metric
- Have a general script that can compute the metrics of interest
- Move dialogue annotation to DialogueKit utilities