diff --git a/docs/API_REFERENCE.md b/docs/API_REFERENCE.md index 387d91e..eedb9da 100644 --- a/docs/API_REFERENCE.md +++ b/docs/API_REFERENCE.md @@ -9,6 +9,8 @@ Complete method reference for the Langfuse Ruby SDK. - [Prompt Management](#prompt-management) - [Tracing & Observability](#tracing--observability) - [Scoring](#scoring) +- [Datasets](#datasets) +- [Experiments](#experiments) - [Attribute Propagation](#attribute-propagation) - [Types](#types) - [Exceptions](#exceptions) @@ -41,7 +43,7 @@ Block receives a configuration object with these properties: | `cache_backend` | Symbol | No | `:memory` | `:memory` or `:rails` | | `cache_lock_timeout` | Integer | No | `10` | Lock timeout (seconds) | | `cache_stale_while_revalidate` | Boolean | No | `false` | Enable stale-while-revalidate | -| `cache_stale_ttl` | Integer | No | `60` when SWR is enabled | Stale TTL (seconds) | +| `cache_stale_ttl` | Integer | No | `0` | Stale TTL (seconds) | | `cache_refresh_threads` | Integer | No | `5` | Background refresh threads | | `batch_size` | Integer | No | `50` | Score batch size | | `flush_interval` | Integer | No | `10` | Score flush interval (seconds) | @@ -218,15 +220,15 @@ List all prompts in the project. **Signature:** ```ruby -list_prompts(page: 1, limit: 50) +list_prompts(page: nil, limit: nil) ``` **Parameters:** | Parameter | Type | Required | Default | Description | | --------- | ------- | -------- | ------- | ---------------- | -| `page` | Integer | No | `1` | Page number | -| `limit` | Integer | No | `50` | Results per page | +| `page` | Integer | No | - | Page number | +| `limit` | Integer | No | - | Results per page | **Returns:** Array of prompt hashes @@ -520,7 +522,8 @@ Create a score for a trace or observation. **Signature:** ```ruby -create_score(name:, value:, trace_id: nil, observation_id: nil, comment: nil, metadata: nil, data_type: :numeric) +create_score(name:, value:, trace_id: nil, observation_id: nil, comment: nil, metadata: nil, + data_type: :numeric, dataset_run_id: nil, config_id: nil) ``` **Parameters:** @@ -534,6 +537,8 @@ create_score(name:, value:, trace_id: nil, observation_id: nil, comment: nil, me | `comment` | String | No | Score comment | | `metadata` | Hash | No | Additional metadata | | `data_type` | Symbol | No | `:numeric`, `:boolean`, or `:categorical` | +| `dataset_run_id` | String | No | Dataset run ID to associate with | +| `config_id` | String | No | Score config ID | **Note:** Must provide at least one of `trace_id` or `observation_id`. @@ -586,15 +591,9 @@ Immediately flush all queued scores to API. **Signature:** ```ruby -flush_scores(timeout: 30) +flush_scores ``` -**Parameters:** - -| Parameter | Type | Required | Default | Description | -| --------- | ------- | -------- | ------- | ----------------------- | -| `timeout` | Integer | No | `30` | Flush timeout (seconds) | - **Example:** ```ruby @@ -615,6 +614,247 @@ Langfuse.flush_scores See [SCORING.md](SCORING.md) for complete guide. +## Datasets + +### `Client#create_dataset` + +Create a new dataset. + +**Signature:** + +```ruby +create_dataset(name:, description: nil, metadata: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| ------------- | ------ | -------- | -------------------------- | +| `name` | String | Yes | Dataset name | +| `description` | String | No | Human-readable description | +| `metadata` | Hash | No | Arbitrary key-value pairs | + +**Returns:** `DatasetClient` + +**Example:** + +```ruby +dataset = client.create_dataset( + name: "qa-eval", + description: "QA evaluation set", + metadata: { domain: "support" } +) +``` + +### `Client#get_dataset` + +Fetch a dataset by name. + +**Signature:** + +```ruby +get_dataset(name) # => DatasetClient +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| --------- | ------ | -------- | -------------------------------------------------------- | +| `name` | String | Yes | Dataset name (supports folder paths like "eval/qa-set") | + +**Returns:** `DatasetClient` + +**Raises:** `NotFoundError` if the dataset doesn't exist + +### `Client#list_datasets` + +List all datasets in the project. + +**Signature:** + +```ruby +list_datasets(page: nil, limit: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| --------- | ------- | -------- | ---------------- | +| `page` | Integer | No | Page number | +| `limit` | Integer | No | Results per page | + +**Returns:** `Array` of dataset metadata + +### `Client#create_dataset_item` + +Create a new dataset item. + +**Signature:** + +```ruby +create_dataset_item(dataset_name:, input: nil, expected_output: nil, + metadata: nil, id: nil, source_trace_id: nil, + source_observation_id: nil, status: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| ----------------------- | ------ | -------- | ---------------------------------------- | +| `dataset_name` | String | Yes | Parent dataset name | +| `input` | Object | No | Input data | +| `expected_output` | Object | No | Expected output for evaluation | +| `metadata` | Hash | No | Arbitrary metadata | +| `id` | String | No | Explicit ID (enables upsert) | +| `source_trace_id` | String | No | Link to source trace | +| `source_observation_id` | String | No | Link to source observation | +| `status` | Symbol | No | `:active` or `:archived` | + +**Returns:** `DatasetItemClient` + +**Example:** + +```ruby +item = client.create_dataset_item( + dataset_name: "qa-eval", + input: { question: "What is Ruby?" }, + expected_output: { answer: "A programming language" } +) +``` + +### `Client#get_dataset_item` + +Fetch a dataset item by ID. + +**Signature:** + +```ruby +get_dataset_item(id) # => DatasetItemClient +``` + +**Raises:** `NotFoundError` if the item doesn't exist + +### `Client#list_dataset_items` + +List items in a dataset. Auto-paginates when `page` is nil. + +**Signature:** + +```ruby +list_dataset_items(dataset_name:, page: nil, limit: nil, + source_trace_id: nil, source_observation_id: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| ----------------------- | ------- | -------- | ---------------------------------------- | +| `dataset_name` | String | Yes | Dataset name | +| `page` | Integer | No | Page number (nil = fetch all pages) | +| `limit` | Integer | No | Results per page | +| `source_trace_id` | String | No | Filter by source trace | +| `source_observation_id` | String | No | Filter by source observation | + +**Returns:** `Array` + +### `Client#delete_dataset_item` + +Delete a dataset item by ID. Idempotent (404 treated as success). + +**Signature:** + +```ruby +delete_dataset_item(id) # => nil +``` + +### `Client#create_dataset_run_item` + +Link a trace to a dataset item within a named run. + +**Signature:** + +```ruby +create_dataset_run_item(dataset_item_id:, run_name:, trace_id: nil, + observation_id: nil, metadata: nil, run_description: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| ----------------- | ------ | -------- | ------------------- | +| `dataset_item_id` | String | Yes | Dataset item ID | +| `run_name` | String | Yes | Run name | +| `trace_id` | String | No | Trace ID | +| `observation_id` | String | No | Observation ID | +| `metadata` | Hash | No | Optional metadata | +| `run_description` | String | No | Run description | + +**Returns:** `Hash` (created dataset run item data) + +See [DATASETS.md](DATASETS.md) for complete guide. + +## Experiments + +### `Client#run_experiment` + +Run an experiment against a named dataset or local data. + +**Signature:** + +```ruby +run_experiment(name:, task:, data: nil, dataset_name: nil, description: nil, + evaluators: [], run_evaluators: [], metadata: nil, run_name: nil) +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +| ---------------- | ------------- | -------- | ----------------------------------------------- | +| `name` | String | Yes | Experiment name | +| `task` | Proc | Yes | Callable receiving item, returning output | +| `dataset_name` | String | No* | Dataset to run against | +| `data` | Array | No* | Local data items (hashes or DatasetItemClients) | +| `description` | String | No | Run description | +| `evaluators` | Array\ | No | Item-level evaluators | +| `run_evaluators` | Array\ | No | Run-level evaluators | +| `metadata` | Hash | No | Metadata attached to each trace | +| `run_name` | String | No | Explicit run name (default: "name - timestamp") | + +\* Provide exactly one of `dataset_name` or `data`. + +**Returns:** `ExperimentResult` + +**Raises:** `ArgumentError` if both or neither of `data`/`dataset_name` provided + +**Example:** + +```ruby +result = client.run_experiment( + name: "qa-v1", + dataset_name: "qa-eval", + task: ->(item) { my_llm_call(item.input) }, + evaluators: [my_evaluator], + metadata: { model: "gpt-4o" } +) +``` + +### `DatasetClient#run_experiment` + +Run an experiment against this dataset's items. + +**Signature:** + +```ruby +dataset.run_experiment(name:, task:, description: nil, evaluators: [], + run_evaluators: [], metadata: nil, run_name: nil) +``` + +Same parameters as `Client#run_experiment` minus `dataset_name` and `data`. + +**Returns:** `ExperimentResult` + +See [EXPERIMENTS.md](EXPERIMENTS.md) for complete guide. + ## Attribute Propagation ### `Langfuse.propagate_attributes` @@ -769,19 +1009,58 @@ See [ERROR_HANDLING.md](ERROR_HANDLING.md) for complete guide. ### `Client#trace_url` -Generate Langfuse UI URL for a trace. +Generate a project-scoped Langfuse UI URL for a trace. **Signature:** ```ruby -trace_url(trace_id) # => String +trace_url(trace_id) # => String | nil ``` **Example:** ```ruby url = client.trace_url("abc123") -# => "https://cloud.langfuse.com/traces/abc123" +# => "https://cloud.langfuse.com/project/{project_id}/traces/abc123" +``` + +Returns `nil` if the project ID cannot be fetched. + +### `Client#dataset_url` + +Generate a project-scoped Langfuse UI URL for a dataset. + +**Signature:** + +```ruby +dataset_url(dataset_id) # => String | nil +``` + +**Example:** + +```ruby +url = client.dataset_url("dataset-uuid") +# => "https://cloud.langfuse.com/project/{project_id}/datasets/dataset-uuid" +``` + +### `Client#dataset_run_url` + +Generate a project-scoped Langfuse UI URL for a dataset run. + +**Signature:** + +```ruby +dataset_run_url(dataset_id:, dataset_run_id:) # => String | nil +``` + +**Example:** + +```ruby +url = client.dataset_run_url( + dataset_id: "dataset-uuid", + dataset_run_id: "run-uuid" +) +# => "https://cloud.langfuse.com/project/{project_id}/datasets/dataset-uuid/runs/run-uuid" ``` ### `Langfuse.shutdown` @@ -830,4 +1109,6 @@ Langfuse.force_flush(timeout: 10) - [PROMPTS.md](PROMPTS.md) - Prompt management - [TRACING.md](TRACING.md) - Tracing patterns - [SCORING.md](SCORING.md) - Scoring guide +- [DATASETS.md](DATASETS.md) - Dataset management +- [EXPERIMENTS.md](EXPERIMENTS.md) - Experiment runner - [ERROR_HANDLING.md](ERROR_HANDLING.md) - Exception handling diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md index 027950c..5091ae7 100644 --- a/docs/CONFIGURATION.md +++ b/docs/CONFIGURATION.md @@ -138,7 +138,7 @@ When enabled, serves stale cached data immediately while refreshing in the backg - ✅ Works with `:memory` backend - ✅ Works with `:rails` backend -- Automatically sets `cache_stale_ttl` to `cache_ttl` if not customized +- Set `cache_stale_ttl` to control how long stale data is served (e.g., same as `cache_ttl`) See [CACHING.md](CACHING.md#stale-while-revalidate-swr) for detailed usage. @@ -166,8 +166,7 @@ config.cache_stale_ttl = :indefinite # Never expire (normalized to 1000 years i - `2x cache_ttl`: More tolerance for API slowdowns - `:indefinite`: Maximum performance, eventual consistency, high availability -**Auto-configuration:** -When `cache_stale_while_revalidate = true` and `cache_stale_ttl` is not set (still `0`), it automatically defaults to `cache_ttl`. +**Important:** When enabling SWR, you should also set `cache_stale_ttl` to a positive value (e.g., same as `cache_ttl`), otherwise stale data expires immediately after the TTL. See [CACHING.md](CACHING.md#stale-while-revalidate-swr) for examples. @@ -256,13 +255,19 @@ config.job_queue = :langfuse # Placeholder - no effect currently ## Environment Variables -The SDK does not automatically read environment variables. You must explicitly pass them in configuration: +The SDK automatically reads these environment variables as defaults when no explicit value is configured: + +- `LANGFUSE_PUBLIC_KEY` — public API key +- `LANGFUSE_SECRET_KEY` — secret API key +- `LANGFUSE_BASE_URL` — API endpoint (defaults to `https://cloud.langfuse.com`) + +Explicit configuration always takes precedence: ```ruby Langfuse.configure do |config| - config.public_key = ENV['LANGFUSE_PUBLIC_KEY'] - config.secret_key = ENV['LANGFUSE_SECRET_KEY'] - config.base_url = ENV['LANGFUSE_BASE_URL'] || 'https://cloud.langfuse.com' + config.public_key = ENV['LANGFUSE_PUBLIC_KEY'] # redundant, already auto-read + config.secret_key = ENV['LANGFUSE_SECRET_KEY'] # redundant, already auto-read + config.base_url = "https://custom.langfuse.com" # overrides env var end ``` diff --git a/docs/DATASETS.md b/docs/DATASETS.md new file mode 100644 index 0000000..12dc05d --- /dev/null +++ b/docs/DATASETS.md @@ -0,0 +1,200 @@ +# Datasets + +Curated test sets for evaluating LLM pipelines. Datasets let you define input/expected-output pairs and systematically compare runs. + +## Overview + +A **dataset** is a named collection of **items**, each containing an input and optionally an expected output. You create dataset items manually, or from production traces. Then you run experiments against the dataset to measure how your LLM pipeline performs. + +Key objects: + +- `DatasetClient` — wraps a dataset with its items and metadata +- `DatasetItemClient` — wraps a single item (input, expected output, status) + +## Creating Datasets + +```ruby +client = Langfuse.client + +dataset = client.create_dataset( + name: "qa-eval", + description: "Question-answering evaluation set", + metadata: { domain: "support", version: 1 } +) +``` + +| Parameter | Type | Required | Description | +| ------------- | ------ | -------- | ------------------------ | +| `name` | String | Yes | Dataset name | +| `description` | String | No | Human-readable description | +| `metadata` | Hash | No | Arbitrary key-value pairs | + +**Returns:** `DatasetClient` + +## Fetching Datasets + +```ruby +dataset = client.get_dataset("qa-eval") + +dataset.name # => "qa-eval" +dataset.description # => "Question-answering evaluation set" +dataset.metadata # => { "domain" => "support", "version" => 1 } +dataset.id # => "clx..." +dataset.url # => "https://cloud.langfuse.com/project/{pid}/datasets/clx..." +dataset.created_at # => Time +dataset.updated_at # => Time +``` + +Folder paths are supported: `client.get_dataset("evaluation/qa-dataset")`. + +## Listing Datasets + +```ruby +# First page (default) +datasets = client.list_datasets + +# With pagination +datasets = client.list_datasets(page: 2, limit: 10) +``` + +| Parameter | Type | Required | Default | Description | +| --------- | ------- | -------- | ------- | ---------------- | +| `page` | Integer | No | - | Page number | +| `limit` | Integer | No | - | Results per page | + +**Returns:** `Array` of dataset metadata + +## Creating Items + +```ruby +item = client.create_dataset_item( + dataset_name: "qa-eval", + input: { question: "What is Ruby?" }, + expected_output: { answer: "A programming language" }, + metadata: { difficulty: "easy" } +) +``` + +| Parameter | Type | Required | Description | +| ----------------------- | ------ | -------- | ------------------------------------------ | +| `dataset_name` | String | Yes | Parent dataset name | +| `input` | Object | No | Input data | +| `expected_output` | Object | No | Expected output for evaluation | +| `metadata` | Hash | No | Arbitrary metadata | +| `id` | String | No | Explicit ID (enables upsert behavior) | +| `source_trace_id` | String | No | Trace that produced this item | +| `source_observation_id` | String | No | Observation that produced this item | +| `status` | Symbol | No | `:active` or `:archived` | + +**Returns:** `DatasetItemClient` + +### DatasetItemClient Properties + +| Property | Type | Description | +| ----------------------- | ----------- | ------------------------------ | +| `id` | String | Unique identifier | +| `dataset_id` | String | Parent dataset ID | +| `input` | Object | Input data | +| `expected_output` | Object | Expected output | +| `metadata` | Hash | Key-value metadata | +| `source_trace_id` | String, nil | Linked source trace | +| `source_observation_id` | String, nil | Linked source observation | +| `status` | String | `"ACTIVE"` or `"ARCHIVED"` | +| `created_at` | Time, nil | Creation timestamp | +| `updated_at` | Time, nil | Last updated timestamp | + +Convenience methods: + +```ruby +item.active? # => true +item.archived? # => false +``` + +## Fetching Items + +```ruby +# By ID +item = client.get_dataset_item("item-uuid-123") + +# List all items (auto-paginates) +items = client.list_dataset_items(dataset_name: "qa-eval") + +# Single page +items = client.list_dataset_items(dataset_name: "qa-eval", page: 1, limit: 20) + +# Filter by source +items = client.list_dataset_items( + dataset_name: "qa-eval", + source_trace_id: "trace-abc" +) +``` + +| Parameter | Type | Required | Description | +| ----------------------- | ------- | -------- | ---------------------------------------- | +| `dataset_name` | String | Yes | Dataset name | +| `page` | Integer | No | Page number (nil = fetch all pages) | +| `limit` | Integer | No | Results per page | +| `source_trace_id` | String | No | Filter by source trace | +| `source_observation_id` | String | No | Filter by source observation | + +**Returns:** `Array` + +You can also access items through the dataset directly: + +```ruby +dataset = client.get_dataset("qa-eval") +dataset.items # => Array (lazy-loaded) +``` + +## Deleting Items + +```ruby +client.delete_dataset_item("item-uuid-123") +``` + +Idempotent — 404 is treated as success. + +## Linking Items to Traces + +### Manual Linking + +Link a dataset item to a trace after running your pipeline: + +```ruby +item.link( + trace_id: "abc123", + run_name: "qa-v2", + observation_id: "obs456", # optional + metadata: { model: "gpt-4o" }, # optional + run_description: "GPT-4o run" # optional +) +``` + +### Traced Execution with `item.run` + +Execute a block within a traced context that automatically links to the dataset item: + +```ruby +item = client.get_dataset_item("item-uuid-123") + +output = item.run(run_name: "qa-v2") do |span| + # span is a traced observation — update it as needed + result = my_llm_call(item.input) + span.update(output: result) + result +end +``` + +| Parameter | Type | Required | Description | +| ---------------- | ------ | -------- | ---------------------------- | +| `run_name` | String | Yes | Run name for grouping | +| `run_description`| String | No | Run description | +| `run_metadata` | Hash | No | Metadata for the trace | + +The block receives a traced span. On completion (or error), the trace is flushed and the item is linked automatically. + +## See Also + +- [EXPERIMENTS.md](EXPERIMENTS.md) - Run systematic evaluations against datasets +- [SCORING.md](SCORING.md) - Score traces and observations +- [API_REFERENCE.md](API_REFERENCE.md) - Complete method reference diff --git a/docs/EXPERIMENTS.md b/docs/EXPERIMENTS.md new file mode 100644 index 0000000..10d9c81 --- /dev/null +++ b/docs/EXPERIMENTS.md @@ -0,0 +1,259 @@ +# Experiments + +Systematic evaluation of tasks against datasets. Run your LLM pipeline over a set of inputs, score the outputs, and compare runs in the Langfuse UI. + +## Quick Start + +```ruby +client = Langfuse.client + +result = client.run_experiment( + name: "qa-v1", + dataset_name: "qa-eval", + task: ->(item) { my_llm_call(item.input) } +) + +puts result.format +puts "#{result.successes.size} passed, #{result.failures.size} failed" +puts result.dataset_run_url # => link to Langfuse UI +``` + +## Entry Points + +### From Client (fetches dataset automatically) + +```ruby +result = client.run_experiment( + name: "qa-v1", + dataset_name: "qa-eval", + task: ->(item) { my_llm_call(item.input) }, + evaluators: [accuracy_evaluator], + metadata: { model: "gpt-4o" } +) +``` + +### From DatasetClient (uses existing items) + +```ruby +dataset = client.get_dataset("qa-eval") + +result = dataset.run_experiment( + name: "qa-v1", + task: ->(item) { my_llm_call(item.input) }, + evaluators: [accuracy_evaluator] +) +``` + +### Local Data Mode + +Run experiments without a server-side dataset: + +```ruby +result = client.run_experiment( + name: "qa-local", + data: [ + { input: "What is Ruby?", expected_output: "A programming language" }, + { input: "What is Python?", expected_output: "A programming language" } + ], + task: ->(item) { my_llm_call(item.input) } +) +``` + +Each hash is wrapped into an `ExperimentItem` struct with `input`, `expected_output`, and `metadata` fields. Both symbol and string keys are accepted. + +## Parameters + +### `Client#run_experiment` + +| Parameter | Type | Required | Description | +| --------------- | ---------------- | -------- | ---------------------------------------------- | +| `name` | String | Yes | Experiment name | +| `task` | Proc | Yes | Callable receiving item, returning output | +| `dataset_name` | String | No* | Dataset to run against | +| `data` | Array | No* | Local data items (hashes or DatasetItemClients) | +| `description` | String | No | Run description | +| `evaluators` | Array\ | No | Item-level evaluators | +| `run_evaluators`| Array\ | No | Run-level evaluators | +| `metadata` | Hash | No | Metadata attached to each trace | +| `run_name` | String | No | Explicit run name (default: "name - timestamp")| + +\* Provide exactly one of `dataset_name` or `data`. + +### `DatasetClient#run_experiment` + +Same parameters minus `dataset_name` and `data` (items come from the dataset). + +## Writing Evaluators + +### Item-Level Evaluators + +An evaluator is any callable (Proc, lambda, method) that receives keyword arguments and returns an `Evaluation`, an Array of them, or a Hash: + +```ruby +accuracy = ->(input:, output:, expected_output:, item:, **) { + score = output.to_s.downcase.include?(expected_output.to_s.downcase) ? 1.0 : 0.0 + Langfuse::Evaluation.new(name: "accuracy", value: score) +} + +result = client.run_experiment( + name: "qa-v1", + dataset_name: "qa-eval", + task: ->(item) { my_llm_call(item.input) }, + evaluators: [accuracy] +) +``` + +**Evaluator keyword arguments:** + +| Keyword | Type | Description | +| ----------------- | --------------------------------- | --------------------------- | +| `input` | Object | The item's input | +| `output` | Object | The task's return value | +| `expected_output` | Object | The item's expected output | +| `item` | DatasetItemClient / ExperimentItem| The original item | +| `metadata` | Hash (optional) | Item metadata (only passed if evaluator accepts it) | + +**Return types:** + +- `Evaluation` — single score +- `Array` — multiple scores +- `Hash` — converted to `Evaluation` (keys: `name`, `value`, `comment`, `data_type`) + +### Evaluation Value Types + +```ruby +# Numeric (default) +Langfuse::Evaluation.new(name: "relevance", value: 0.85) + +# Boolean +Langfuse::Evaluation.new(name: "is_correct", value: true, data_type: :boolean) + +# Categorical +Langfuse::Evaluation.new(name: "quality_tier", value: "high", data_type: :categorical) + +# With comment and metadata +Langfuse::Evaluation.new( + name: "relevance", + value: 0.85, + comment: "Mostly relevant, minor tangent", + metadata: { model: "gpt-4o" } +) +``` + +### Run-Level Evaluators + +Run-level evaluators receive all item results at once, for aggregate metrics: + +```ruby +avg_length = ->(item_results:) { + lengths = item_results.select(&:success?).map { |r| r.output.to_s.length } + avg = lengths.sum.to_f / lengths.size + Langfuse::Evaluation.new(name: "avg_output_length", value: avg) +} + +result = client.run_experiment( + name: "qa-v1", + dataset_name: "qa-eval", + task: ->(item) { my_llm_call(item.input) }, + run_evaluators: [avg_length] +) +``` + +Run-level evaluators receive `item_results:` — an `Array`. + +## Result Objects + +### ExperimentResult + +Returned by `run_experiment`. + +| Property | Type | Description | +| ------------------ | ------------------- | ------------------------------------------ | +| `name` | String | Experiment name | +| `run_name` | String, nil | Auto-generated run name (name + timestamp) | +| `description` | String, nil | Run description | +| `item_results` | Array\ | All per-item results | +| `run_evaluations` | Array\ | Run-level evaluation results | +| `dataset_run_id` | String, nil | Dataset run ID from the server | +| `dataset_run_url` | String, nil | URL to the run in Langfuse UI | + +**Methods:** + +```ruby +result.successes # => Array (no errors) +result.failures # => Array (had errors) +result.format # => summary string +result.format(include_item_results: true) # => detailed per-item report +``` + +### ItemResult + +One per item processed. + +| Property | Type | Description | +| ---------------- | --------------------------------- | -------------------------- | +| `item` | DatasetItemClient / ExperimentItem| Original input item | +| `output` | Object, nil | Task output (nil on error) | +| `trace_id` | String, nil | Trace ID | +| `observation_id` | String, nil | Observation/span ID | +| `evaluations` | Array\ | Item-level scores | +| `error` | StandardError, nil | Error if task failed | + +**Methods:** + +```ruby +result.success? # => true if no error +result.failed? # => true if error present +``` + +## End-to-End Example + +```ruby +client = Langfuse.client + +# 1. Create a dataset +dataset = client.create_dataset(name: "support-qa") + +# 2. Add items +[ + { q: "How do I reset my password?", a: "Go to Settings > Security > Reset Password" }, + { q: "What are your hours?", a: "Monday-Friday, 9am-5pm EST" } +].each do |pair| + client.create_dataset_item( + dataset_name: "support-qa", + input: { question: pair[:q] }, + expected_output: { answer: pair[:a] } + ) +end + +# 3. Define evaluators +accuracy = ->(input:, output:, expected_output:, item:, **) { + match = output.to_s.downcase.include?(expected_output[:answer].to_s.downcase) + Langfuse::Evaluation.new(name: "contains_answer", value: match, data_type: :boolean) +} + +pass_rate = ->(item_results:) { + passed = item_results.count(&:success?) + Langfuse::Evaluation.new(name: "pass_rate", value: passed.to_f / item_results.size) +} + +# 4. Run experiment +result = client.run_experiment( + name: "support-bot-v1", + dataset_name: "support-qa", + task: ->(item) { generate_support_response(item.input[:question]) }, + evaluators: [accuracy], + run_evaluators: [pass_rate], + metadata: { model: "gpt-4o", temperature: 0.3 } +) + +# 5. Inspect results +puts result.format(include_item_results: true) +puts "URL: #{result.dataset_run_url}" +``` + +## See Also + +- [DATASETS.md](DATASETS.md) - Dataset CRUD operations +- [SCORING.md](SCORING.md) - Scoring guide +- [API_REFERENCE.md](API_REFERENCE.md) - Complete method reference diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md index 1fd8ef6..c10d9e2 100644 --- a/docs/GETTING_STARTED.md +++ b/docs/GETTING_STARTED.md @@ -307,6 +307,8 @@ See [ERROR_HANDLING.md](ERROR_HANDLING.md) for complete error reference. - **[PROMPTS.md](PROMPTS.md)** - Chat prompts, versioning, Mustache templating - **[TRACING.md](TRACING.md)** - Nested observations, RAG patterns, OpenTelemetry - **[SCORING.md](SCORING.md)** - Add quality scores to traces +- **[DATASETS.md](DATASETS.md)** - Create and manage evaluation datasets +- **[EXPERIMENTS.md](EXPERIMENTS.md)** - Run systematic evaluations with the experiment runner - **[CACHING.md](CACHING.md)** - Optimize performance with caching - **[RAILS.md](RAILS.md)** - Rails-specific patterns and testing - **[CONFIGURATION.md](CONFIGURATION.md)** - All configuration options