Skip to content

Conversation

@rivinduw
Copy link

@rivinduw rivinduw commented Oct 7, 2024

This PR introduces support for evaluations in the Arcee Python SDK.
Added start_evaluation function to arcee/api.py:

  • Allows users to initiate various types of evaluation jobs, including LLM-as-a-judge and lm-eval-harness benchmarks.

Usage Example for testing

import os
os.environ['ARCEE_API_URL'] = 'https://arcee-dev.dev.arcee.ai/api'
os.environ['ARCEE_ORG'] = 'rivinduorg'
os.environ['ARCEE_API_KEY'] = ''

openai_api_key = ''

import arcee
evaluation_params = {'evaluations_name': 'evals_test_oct7',
 'eval_type': 'llm_as_a_judge',
 'qa_set_name': 'mmlu_20q',
 'judge_model': {'model_name': 'gpt-4o',
  'custom_prompt': 'Evaluate which response better adheres to factual accuracy, clarity, and relevance.',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'deployment_model': {'model_name': 'gpt-4o-mini',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'reference_model': {'model_name': 'gpt-3.5-turbo-0125',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key}}

result = arcee.start_evaluation(**evaluation_params)
eval_status = arcee.get_evaluation_status(result['evaluations_id'])
image

@rivinduw rivinduw changed the title Evaluations api Add Evaluation Support to Arcee Python SDK Oct 7, 2024
@Jacobsolawetz
Copy link
Contributor

Jacobsolawetz commented Oct 7, 2024

Noticed evaluation with different params but same name resolves to same ID, should error

@rivinduw
Copy link
Author

rivinduw commented Oct 7, 2024

Noticed evaluation with different params but same name resolves to same ID, should error

Yup, params would get overwritten currently so we don't get two evaluations with the same name but different IDs.
Should we error here or in platform? I think start pretraining might have the same behavior

@rivinduw
Copy link
Author

rivinduw commented Oct 8, 2024

I have a local branch of platform to raises an error when evaluations have duplicates but thinking we should be consistent across all the other services too.

Currently corpus uploader https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/corpus.py#L171 has the same logic to update with new params.

Pretraining https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/pretraining.py#L65, , deployment etc it seems to either assumes the existing params have not changed or look up each field in supabase separately and throw a X with this name does not exist error.

Any thoughts on the best consistent way to deal with repeated start_x calls @mryave @nason ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants