grafana · aalhour · Jan 14, 2026 · Dec 12, 2025 · Jan 14, 2026 · Jan 14, 2026
@@ -0,0 +1,3 @@
+# Databricks is a cloud-hosted SaaS service that cannot be spun up locally.
+# This runbook provides manual setup instructions for existing Databricks workspaces.
+# Configuration validation tests can still run in CI.
@@ -0,0 +1,260 @@
+# Databricks Monitoring Runbook
+
+This runbook guides you through setting up monitoring for Databricks using Grafana Alloy. Unlike automated sample apps, this requires manual configuration in your existing Databricks workspace.
+
+## Prerequisites
+
+Before you begin, ensure you have the following:
+
+- A Databricks workspace with Unity Catalog enabled
+- Administrative access to create Service Principals
+- A SQL Warehouse (serverless is recommended for cost efficiency)
+- Grafana Alloy installed on a host that can reach Databricks APIs
+- Grafana Cloud credentials (or any Prometheus-compatible endpoint)
+
+## Quick Start
+
+To get started with this runbook, follow these steps:
+
+1. **Clone the repository**:
+   ```sh
+   git clone https://github.com/grafana/integration-sample-apps.git
+   cd sample-apps/databricks
+   ```
+1. **Configure Databricks** (follow Databricks Configuration section below)
+1. **Configure Alloy**:
+   - Copy `configs/alloy-simple.alloy` to your Alloy config directory
+   - Update with your Databricks credentials and workspace details
+   - Restart Alloy service
+1. **Verify metrics**:
+   - Query `databricks_up` in your Prometheus instance
+   - Check Alloy logs for successful scrapes
+
+## Databricks Configuration
+
+### Step 1: Get your workspace hostname
+
+1. Copy your workspace URL subdomain, for example, `dbc-abc123-def456.cloud.databricks.com`.
+
+### Step 2: Create or configure SQL Warehouse
+
+1. Go to **SQL Warehouses** in the sidebar.
+1. Either select an existing warehouse or click **Create SQL warehouse**:
+   - **Size**: 2X-Small (minimum size to reduce costs)
+   - **Auto stop**: After 10 minutes of inactivity
+   - **Scaling**: Min 1, Max 1 cluster
+1. Click **Create**, then go to the **Connection Details** tab.
+1. Copy the **HTTP path**, for example, `/sql/1.0/warehouses/abc123def456`.
+
+### Step 3: Create a Service Principal
+
+1. Click your workspace name (top-right) and select **Manage Account**.
+1. Go to **User Management** > **Service Principals** tab > **Add service principal**.
+1. Enter a name, for example, `grafana-cloud-integration`.
+1. Go to **Credentials & secrets** tab > **OAuth secrets** > **Generate secret**.
+1. Select the maximum lifetime (730 days) and click **Generate**.
+1. Copy the **Client ID** and **Client Secret**. You will need both for the Alloy configuration.
+
+### Step 4: Assign the Service Principal to your workspace
+
+1. Go to **Workspaces** in the sidebar and select your workspace.
+1. Go to the **Permissions** tab and click **Add permissions**.
+1. Search for the Service Principal and assign it the **Admin** permission.
+
+### Step 5: Grant SQL permissions to the Service Principal
+
+As a metastore admin or user with MANAGE privilege, run the following SQL statements in a query editor:
+
+```sql
+GRANT USE CATALOG ON CATALOG system TO `<your-service-principal-client-id>`;
+GRANT USE SCHEMA ON SCHEMA system.billing TO `<your-service-principal-client-id>`;
+GRANT SELECT ON SCHEMA system.billing TO `<your-service-principal-client-id>`;
+GRANT USE SCHEMA ON SCHEMA system.query TO `<your-service-principal-client-id>`;
+GRANT SELECT ON SCHEMA system.query TO `<your-service-principal-client-id>`;
+GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<your-service-principal-client-id>`;
+GRANT SELECT ON SCHEMA system.lakeflow TO `<your-service-principal-client-id>`;
+```
+
+Replace `<your-service-principal-client-id>` with your Service Principal's Client ID.
+
+Refer to the [Databricks documentation](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) for detailed OAuth2 M2M setup instructions.
+
+## Alloy Configuration
+
+### Simple Configuration
+
+See [`configs/alloy-simple.alloy`](configs/alloy-simple.alloy) for a basic setup that collects all default metrics with recommended settings.
+
+### Advanced Configuration
+
+See [`configs/alloy-advanced.alloy`](configs/alloy-advanced.alloy) for a configuration with all optional parameters, tuning options, and metric filtering examples.
+
+### Environment Variables
+
+Store sensitive credentials as environment variables:
+
+```bash
+export DATABRICKS_CLIENT_ID="your-application-id"
+export DATABRICKS_CLIENT_SECRET="your-client-secret"
+export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push"
+export PROMETHEUS_USER="your-prometheus-username"
+export PROMETHEUS_PASS="your-prometheus-password"
+```
+
+### Configuration Options
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `server_hostname` | Required | Databricks workspace hostname (e.g., `dbc-abc123.cloud.databricks.com`) |
+| `warehouse_http_path` | Required | SQL Warehouse HTTP path (e.g., `/sql/1.0/warehouses/xyz`) |
+| `client_id` | Required | OAuth2 Application ID of your Service Principal |
+| `client_secret` | Required | OAuth2 Client Secret |
+| `query_timeout` | `5m` | Timeout for individual SQL queries |
+| `billing_lookback` | `24h` | How far back to query billing data |
+| `jobs_lookback` | `3h` | How far back to query job runs |
+| `pipelines_lookback` | `3h` | How far back to query pipeline runs |
+| `queries_lookback` | `2h` | How far back to query SQL warehouse queries |
+| `sla_threshold_seconds` | `3600` | Duration threshold for job SLA miss detection |
+| `collect_task_retries` | `false` | Collect task-level retry metrics (⚠️ high cardinality) |
+
+### Tuning Recommendations
+
+- **`scrape_interval`**: Use 10-30 minutes. The exporter queries Databricks System Tables which can be slow and costly. Increase the interval to reduce SQL Warehouse usage.
+- **`scrape_timeout`**: Must be less than `scrape_interval`. Typical scrapes take 90-120 seconds depending on data volume.
+- **Lookback windows**: Should be at least 2x the scrape interval to ensure data continuity between scrapes. The defaults (`3h` for jobs and pipelines, `2h` for queries) work well with 10-30 minute scrape intervals.
+
+## Validating Metrics
+
+### Check Alloy Status
+
+```bash
+# Check Alloy service status
+systemctl status alloy
+
+# View Alloy logs
+journalctl -u alloy -f
+
+# Check metrics endpoint
+curl http://localhost:12345/metrics | grep databricks
+```
+
+### Verify in Prometheus
+
+Query for the health metric:
+
+```promql
+databricks_up{job="databricks"}
+```
+
+Should return `1` if the exporter is healthy.
+
+### Check Key Metrics
+
+```promql
+# Billing metrics
+databricks_billing_dbus_total
+
+# Job metrics
+databricks_job_runs_total
+
+# Query metrics
+databricks_queries_total
+
+# Exporter up/down
+databricks_up
+```
+
+## Metrics Collected
+
+The exporter collects 18 metrics across four categories:
+
+### Billing Metrics
+- `databricks_billing_dbus_total` - Daily DBU consumption per workspace and SKU
+- `databricks_billing_cost_estimate_usd` - Estimated cost in USD
+- `databricks_price_change_events_total` - Count of price changes per SKU
+
+### Job Metrics
+- `databricks_job_runs_total` - Total job runs
+- `databricks_job_run_status_total` - Job run counts by result state
+- `databricks_job_run_duration_seconds` - Job duration quantiles (p50, p95, p99)
+- `databricks_task_retries_total` - Task retry counts (optional, high cardinality)
+- `databricks_job_sla_miss_total` - Jobs exceeding SLA threshold
+
+### Pipeline Metrics
+- `databricks_pipeline_runs_total` - Total pipeline runs
+- `databricks_pipeline_run_status_total` - Pipeline runs by result state
+- `databricks_pipeline_run_duration_seconds` - Pipeline duration quantiles
+- `databricks_pipeline_retry_events_total` - Pipeline retry counts
+- `databricks_pipeline_freshness_lag_seconds` - Data freshness lag
+
+### SQL Query Metrics
+- `databricks_queries_total` - Total SQL queries executed
+- `databricks_query_errors_total` - Failed query count
+- `databricks_query_duration_seconds` - Query duration quantiles
+- `databricks_queries_running` - Estimated concurrent queries
+
+### System Metrics
+- `databricks_up` - Exporter health (1 = healthy, 0 = unhealthy)
+
+## Troubleshooting
+
+### Common Issues
+
+#### Authentication Errors (401)
+**Symptom**: Alloy logs show `401 Unauthorized`
+
+**Solution**:
+- Verify Client ID and Client Secret are correct
+- Ensure the Service Principal exists and hasn't expired (check OAuth secret lifetime)
+- Verify the Service Principal has workspace Admin permission
+
+#### No Metrics Appearing
+**Symptom**: `databricks_up` returns no data or returns `0`
+
+**Solution**:
+- Check that the SQL Warehouse is running (or configured to auto-start)
+- Verify the Service Principal has all required SQL permissions (re-run GRANT statements)
+- Check Alloy logs for SQL query errors
+- Verify network connectivity to `<your-workspace>.cloud.databricks.com`
+
+#### SQL Permission Errors
+**Symptom**: Alloy logs show `PERMISSION_DENIED` or `TABLE_OR_VIEW_NOT_FOUND`
+
+**Solution**:
+- Re-run the GRANT SQL statements as a metastore admin
+- Verify Unity Catalog is enabled in your workspace
+- Check that System Tables are enabled (they should be by default with Unity Catalog)
+
+#### Connection Timeouts
+**Symptom**: Queries take longer than `scrape_timeout`
+
+**Solution**:
+- Increase `scrape_timeout` (but keep it less than `scrape_interval`)
+- Reduce lookback windows to query less data
+- Use a larger SQL Warehouse size if queries are consistently slow
+- Consider increasing `scrape_interval` to 20-30 minutes
+
+#### High Cardinality Warning
+**Symptom**: Too many time series, high storage costs
+
+**Solution**:
+- Disable `collect_task_retries` if enabled (this adds `task_key` label)
+- Review metric cardinality with `databricks_*` queries in Prometheus
+- Consider metric relabeling to drop high-cardinality labels (see `alloy-advanced.alloy` for examples)
+
+## Additional Resources
+
+- [Databricks OAuth2 M2M Documentation](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html)
+- [Databricks System Tables Documentation](https://docs.databricks.com/en/admin/system-tables/index.html)
+- [Grafana Alloy Documentation](https://grafana.com/docs/alloy/latest/)
+- [Databricks Exporter GitHub](https://github.com/grafana/databricks-prometheus-exporter)
+- [Integration Documentation](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/integrations/integration-reference/integration-databricks/)
+
+## Platform Support
+
+This runbook is platform-agnostic. Grafana Alloy can be installed on:
+- Linux (systemd service)
+- Docker (container)
+- Kubernetes (Helm chart or operator)
+
+Refer to the [Alloy installation documentation](https://grafana.com/docs/alloy/latest/get-started/install/) for your platform.
@@ -0,0 +1,83 @@
+// Advanced Databricks monitoring configuration for Grafana Alloy
+//
+// This configuration includes all optional parameters and tuning options.
+// Use this as a reference for customizing your setup.
+//
+// Prerequisites:
+// - Databricks workspace with Unity Catalog and System Tables enabled
+// - Service Principal with OAuth2 M2M authentication configured
+// - SQL Warehouse for querying System Tables (serverless recommended for cost efficiency)
+//
+// Tuning recommendations:
+// - Lookback windows should be at least 2x the scrape_interval to ensure data continuity
+// - With a 10-minute scrape interval, use at least 20 minutes of lookback
+// - Increase scrape_interval to 20-30 minutes to reduce SQL Warehouse costs
+//
+// Set environment variables before starting Alloy:
+//   export DATABRICKS_CLIENT_ID="<your-service-principal-client-id>"
+//   export DATABRICKS_CLIENT_SECRET="<your-service-principal-client-secret>"
+//   export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push"
+//   export PROMETHEUS_USER="<your-prometheus-username>"
+//   export PROMETHEUS_PASS="<your-prometheus-password>"
+
+prometheus.exporter.databricks "example" {
+  // Required parameters
+  server_hostname     = "dbc-abc123-def456.cloud.databricks.com"  // Replace with your workspace hostname
+  warehouse_http_path = "/sql/1.0/warehouses/abc123def456"         // Replace with your SQL Warehouse HTTP path
+  client_id           = env("DATABRICKS_CLIENT_ID")
+  client_secret       = env("DATABRICKS_CLIENT_SECRET")
+
+  // Optional tuning parameters (all have defaults)
+  query_timeout          = "5m"    // Timeout for individual SQL queries (default: 5m)
+  billing_lookback       = "24h"   // How far back to query billing data (default: 24h, Databricks billing has 24-48h lag)
+  jobs_lookback          = "3h"    // How far back to query job runs (default: 3h)
+  pipelines_lookback     = "3h"    // How far back to query pipeline runs (default: 3h)
+  queries_lookback       = "2h"    // How far back to query SQL warehouse queries (default: 2h)
+  sla_threshold_seconds  = 3600    // Duration threshold in seconds for job SLA miss detection (default: 3600)
+  collect_task_retries   = false   // Collect task retry metrics (default: false) ⚠️ HIGH CARDINALITY: adds task_key label
+}
+
+// Configure a prometheus.scrape component to collect databricks metrics.
+prometheus.scrape "databricks" {
+  targets         = prometheus.exporter.databricks.example.targets
+  forward_to      = [prometheus.remote_write.grafana_cloud.receiver]
+  scrape_interval = "10m"  // Recommended: 10-30 minutes (System Table queries can be slow and costly)
+  scrape_timeout  = "9m"   // Must be < scrape_interval (typical scrapes take 90-120s)
+
+  // Optional: Enable clustering for high availability
+  clustering {
+    enabled = true
+  }
+}
+
+prometheus.remote_write "grafana_cloud" {
+  endpoint {
+    url = env("PROMETHEUS_URL")
+
+    basic_auth {
+      username = env("PROMETHEUS_USER")
+      password = env("PROMETHEUS_PASS")
+    }
+  }
+}
+
+// Optional: Add metric relabeling to reduce cardinality or filter metrics
+// To use this, change the prometheus.scrape forward_to to:
+//   forward_to = [prometheus.relabel.databricks_metrics.receiver]
+//
+// prometheus.relabel "databricks_metrics" {
+//   forward_to = [prometheus.remote_write.grafana_cloud.receiver]
+//
+//   // Example: Drop high-cardinality labels if needed
+//   rule {
+//     source_labels = ["task_key"]
+//     action        = "labeldrop"
+//   }
+//
+//   // Example: Keep only specific metrics
+//   rule {
+//     source_labels = ["__name__"]
+//     regex         = "databricks_(up|billing_.*|job_run_status_total)"
+//     action        = "keep"
+//   }
+// }
@@ -0,0 +1,44 @@
+// Simple Databricks monitoring configuration for Grafana Alloy
+//
+// This configuration:
+// - Scrapes metrics from Databricks System Tables using the built-in exporter
+// - Forwards metrics to Grafana Cloud (or any Prometheus-compatible endpoint)
+// - Uses environment variables for sensitive credentials
+//
+// Prerequisites:
+// - Databricks workspace with Unity Catalog and System Tables enabled
+// - Service Principal with OAuth2 M2M authentication configured
+// - SQL Warehouse for querying System Tables (serverless recommended for cost efficiency)
+//
+// Set environment variables before starting Alloy:
+//   export DATABRICKS_CLIENT_ID="<your-service-principal-client-id>"
+//   export DATABRICKS_CLIENT_SECRET="<your-service-principal-client-secret>"
+//   export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push"
+//   export PROMETHEUS_USER="<your-prometheus-username>"
+//   export PROMETHEUS_PASS="<your-prometheus-password>"
+
+prometheus.exporter.databricks "example" {
+  server_hostname     = "dbc-abc123-def456.cloud.databricks.com"  // Replace with your workspace hostname
+  warehouse_http_path = "/sql/1.0/warehouses/abc123def456"         // Replace with your SQL Warehouse HTTP path
+  client_id           = env("DATABRICKS_CLIENT_ID")
+  client_secret       = env("DATABRICKS_CLIENT_SECRET")
+}
+
+// Configure a prometheus.scrape component to collect databricks metrics.
+prometheus.scrape "databricks" {
+  targets         = prometheus.exporter.databricks.example.targets
+  forward_to      = [prometheus.remote_write.grafana_cloud.receiver]
+  scrape_interval = "10m"  // Recommended: 10-30 minutes
+  scrape_timeout  = "9m"   // Must be < scrape_interval
+}
+
+prometheus.remote_write "grafana_cloud" {
+  endpoint {
+    url = env("PROMETHEUS_URL")
+
+    basic_auth {
+      username = env("PROMETHEUS_USER")
+      password = env("PROMETHEUS_PASS")
+    }
+  }
+}