diff --git a/sample-apps/databricks/.CI_BYPASS b/sample-apps/databricks/.CI_BYPASS new file mode 100644 index 0000000..8b3415c --- /dev/null +++ b/sample-apps/databricks/.CI_BYPASS @@ -0,0 +1,3 @@ +# Databricks is a cloud-hosted SaaS service that cannot be spun up locally. +# This runbook provides manual setup instructions for existing Databricks workspaces. +# Configuration validation tests can still run in CI. diff --git a/sample-apps/databricks/README.md b/sample-apps/databricks/README.md new file mode 100644 index 0000000..6ed8d88 --- /dev/null +++ b/sample-apps/databricks/README.md @@ -0,0 +1,260 @@ +# Databricks Monitoring Runbook + +This runbook guides you through setting up monitoring for Databricks using Grafana Alloy. Unlike automated sample apps, this requires manual configuration in your existing Databricks workspace. + +## Prerequisites + +Before you begin, ensure you have the following: + +- A Databricks workspace with Unity Catalog enabled +- Administrative access to create Service Principals +- A SQL Warehouse (serverless is recommended for cost efficiency) +- Grafana Alloy installed on a host that can reach Databricks APIs +- Grafana Cloud credentials (or any Prometheus-compatible endpoint) + +## Quick Start + +To get started with this runbook, follow these steps: + +1. **Clone the repository**: + ```sh + git clone https://github.com/grafana/integration-sample-apps.git + cd sample-apps/databricks + ``` +1. **Configure Databricks** (follow Databricks Configuration section below) +1. **Configure Alloy**: + - Copy `configs/alloy-simple.alloy` to your Alloy config directory + - Update with your Databricks credentials and workspace details + - Restart Alloy service +1. **Verify metrics**: + - Query `databricks_up` in your Prometheus instance + - Check Alloy logs for successful scrapes + +## Databricks Configuration + +### Step 1: Get your workspace hostname + +1. Copy your workspace URL subdomain, for example, `dbc-abc123-def456.cloud.databricks.com`. + +### Step 2: Create or configure SQL Warehouse + +1. Go to **SQL Warehouses** in the sidebar. +1. Either select an existing warehouse or click **Create SQL warehouse**: + - **Size**: 2X-Small (minimum size to reduce costs) + - **Auto stop**: After 10 minutes of inactivity + - **Scaling**: Min 1, Max 1 cluster +1. Click **Create**, then go to the **Connection Details** tab. +1. Copy the **HTTP path**, for example, `/sql/1.0/warehouses/abc123def456`. + +### Step 3: Create a Service Principal + +1. Click your workspace name (top-right) and select **Manage Account**. +1. Go to **User Management** > **Service Principals** tab > **Add service principal**. +1. Enter a name, for example, `grafana-cloud-integration`. +1. Go to **Credentials & secrets** tab > **OAuth secrets** > **Generate secret**. +1. Select the maximum lifetime (730 days) and click **Generate**. +1. Copy the **Client ID** and **Client Secret**. You will need both for the Alloy configuration. + +### Step 4: Assign the Service Principal to your workspace + +1. Go to **Workspaces** in the sidebar and select your workspace. +1. Go to the **Permissions** tab and click **Add permissions**. +1. Search for the Service Principal and assign it the **Admin** permission. + +### Step 5: Grant SQL permissions to the Service Principal + +As a metastore admin or user with MANAGE privilege, run the following SQL statements in a query editor: + +```sql +GRANT USE CATALOG ON CATALOG system TO ``; +GRANT USE SCHEMA ON SCHEMA system.billing TO ``; +GRANT SELECT ON SCHEMA system.billing TO ``; +GRANT USE SCHEMA ON SCHEMA system.query TO ``; +GRANT SELECT ON SCHEMA system.query TO ``; +GRANT USE SCHEMA ON SCHEMA system.lakeflow TO ``; +GRANT SELECT ON SCHEMA system.lakeflow TO ``; +``` + +Replace `` with your Service Principal's Client ID. + +Refer to the [Databricks documentation](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) for detailed OAuth2 M2M setup instructions. + +## Alloy Configuration + +### Simple Configuration + +See [`configs/alloy-simple.alloy`](configs/alloy-simple.alloy) for a basic setup that collects all default metrics with recommended settings. + +### Advanced Configuration + +See [`configs/alloy-advanced.alloy`](configs/alloy-advanced.alloy) for a configuration with all optional parameters, tuning options, and metric filtering examples. + +### Environment Variables + +Store sensitive credentials as environment variables: + +```bash +export DATABRICKS_CLIENT_ID="your-application-id" +export DATABRICKS_CLIENT_SECRET="your-client-secret" +export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push" +export PROMETHEUS_USER="your-prometheus-username" +export PROMETHEUS_PASS="your-prometheus-password" +``` + +### Configuration Options + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `server_hostname` | Required | Databricks workspace hostname (e.g., `dbc-abc123.cloud.databricks.com`) | +| `warehouse_http_path` | Required | SQL Warehouse HTTP path (e.g., `/sql/1.0/warehouses/xyz`) | +| `client_id` | Required | OAuth2 Application ID of your Service Principal | +| `client_secret` | Required | OAuth2 Client Secret | +| `query_timeout` | `5m` | Timeout for individual SQL queries | +| `billing_lookback` | `24h` | How far back to query billing data | +| `jobs_lookback` | `3h` | How far back to query job runs | +| `pipelines_lookback` | `3h` | How far back to query pipeline runs | +| `queries_lookback` | `2h` | How far back to query SQL warehouse queries | +| `sla_threshold_seconds` | `3600` | Duration threshold for job SLA miss detection | +| `collect_task_retries` | `false` | Collect task-level retry metrics (⚠️ high cardinality) | + +### Tuning Recommendations + +- **`scrape_interval`**: Use 10-30 minutes. The exporter queries Databricks System Tables which can be slow and costly. Increase the interval to reduce SQL Warehouse usage. +- **`scrape_timeout`**: Must be less than `scrape_interval`. Typical scrapes take 90-120 seconds depending on data volume. +- **Lookback windows**: Should be at least 2x the scrape interval to ensure data continuity between scrapes. The defaults (`3h` for jobs and pipelines, `2h` for queries) work well with 10-30 minute scrape intervals. + +## Validating Metrics + +### Check Alloy Status + +```bash +# Check Alloy service status +systemctl status alloy + +# View Alloy logs +journalctl -u alloy -f + +# Check metrics endpoint +curl http://localhost:12345/metrics | grep databricks +``` + +### Verify in Prometheus + +Query for the health metric: + +```promql +databricks_up{job="databricks"} +``` + +Should return `1` if the exporter is healthy. + +### Check Key Metrics + +```promql +# Billing metrics +databricks_billing_dbus_total + +# Job metrics +databricks_job_runs_total + +# Query metrics +databricks_queries_total + +# Exporter up/down +databricks_up +``` + +## Metrics Collected + +The exporter collects 18 metrics across four categories: + +### Billing Metrics +- `databricks_billing_dbus_total` - Daily DBU consumption per workspace and SKU +- `databricks_billing_cost_estimate_usd` - Estimated cost in USD +- `databricks_price_change_events_total` - Count of price changes per SKU + +### Job Metrics +- `databricks_job_runs_total` - Total job runs +- `databricks_job_run_status_total` - Job run counts by result state +- `databricks_job_run_duration_seconds` - Job duration quantiles (p50, p95, p99) +- `databricks_task_retries_total` - Task retry counts (optional, high cardinality) +- `databricks_job_sla_miss_total` - Jobs exceeding SLA threshold + +### Pipeline Metrics +- `databricks_pipeline_runs_total` - Total pipeline runs +- `databricks_pipeline_run_status_total` - Pipeline runs by result state +- `databricks_pipeline_run_duration_seconds` - Pipeline duration quantiles +- `databricks_pipeline_retry_events_total` - Pipeline retry counts +- `databricks_pipeline_freshness_lag_seconds` - Data freshness lag + +### SQL Query Metrics +- `databricks_queries_total` - Total SQL queries executed +- `databricks_query_errors_total` - Failed query count +- `databricks_query_duration_seconds` - Query duration quantiles +- `databricks_queries_running` - Estimated concurrent queries + +### System Metrics +- `databricks_up` - Exporter health (1 = healthy, 0 = unhealthy) + +## Troubleshooting + +### Common Issues + +#### Authentication Errors (401) +**Symptom**: Alloy logs show `401 Unauthorized` + +**Solution**: +- Verify Client ID and Client Secret are correct +- Ensure the Service Principal exists and hasn't expired (check OAuth secret lifetime) +- Verify the Service Principal has workspace Admin permission + +#### No Metrics Appearing +**Symptom**: `databricks_up` returns no data or returns `0` + +**Solution**: +- Check that the SQL Warehouse is running (or configured to auto-start) +- Verify the Service Principal has all required SQL permissions (re-run GRANT statements) +- Check Alloy logs for SQL query errors +- Verify network connectivity to `.cloud.databricks.com` + +#### SQL Permission Errors +**Symptom**: Alloy logs show `PERMISSION_DENIED` or `TABLE_OR_VIEW_NOT_FOUND` + +**Solution**: +- Re-run the GRANT SQL statements as a metastore admin +- Verify Unity Catalog is enabled in your workspace +- Check that System Tables are enabled (they should be by default with Unity Catalog) + +#### Connection Timeouts +**Symptom**: Queries take longer than `scrape_timeout` + +**Solution**: +- Increase `scrape_timeout` (but keep it less than `scrape_interval`) +- Reduce lookback windows to query less data +- Use a larger SQL Warehouse size if queries are consistently slow +- Consider increasing `scrape_interval` to 20-30 minutes + +#### High Cardinality Warning +**Symptom**: Too many time series, high storage costs + +**Solution**: +- Disable `collect_task_retries` if enabled (this adds `task_key` label) +- Review metric cardinality with `databricks_*` queries in Prometheus +- Consider metric relabeling to drop high-cardinality labels (see `alloy-advanced.alloy` for examples) + +## Additional Resources + +- [Databricks OAuth2 M2M Documentation](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) +- [Databricks System Tables Documentation](https://docs.databricks.com/en/admin/system-tables/index.html) +- [Grafana Alloy Documentation](https://grafana.com/docs/alloy/latest/) +- [Databricks Exporter GitHub](https://github.com/grafana/databricks-prometheus-exporter) +- [Integration Documentation](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/integrations/integration-reference/integration-databricks/) + +## Platform Support + +This runbook is platform-agnostic. Grafana Alloy can be installed on: +- Linux (systemd service) +- Docker (container) +- Kubernetes (Helm chart or operator) + +Refer to the [Alloy installation documentation](https://grafana.com/docs/alloy/latest/get-started/install/) for your platform. diff --git a/sample-apps/databricks/configs/alloy-advanced.alloy b/sample-apps/databricks/configs/alloy-advanced.alloy new file mode 100644 index 0000000..497c37f --- /dev/null +++ b/sample-apps/databricks/configs/alloy-advanced.alloy @@ -0,0 +1,83 @@ +// Advanced Databricks monitoring configuration for Grafana Alloy +// +// This configuration includes all optional parameters and tuning options. +// Use this as a reference for customizing your setup. +// +// Prerequisites: +// - Databricks workspace with Unity Catalog and System Tables enabled +// - Service Principal with OAuth2 M2M authentication configured +// - SQL Warehouse for querying System Tables (serverless recommended for cost efficiency) +// +// Tuning recommendations: +// - Lookback windows should be at least 2x the scrape_interval to ensure data continuity +// - With a 10-minute scrape interval, use at least 20 minutes of lookback +// - Increase scrape_interval to 20-30 minutes to reduce SQL Warehouse costs +// +// Set environment variables before starting Alloy: +// export DATABRICKS_CLIENT_ID="" +// export DATABRICKS_CLIENT_SECRET="" +// export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push" +// export PROMETHEUS_USER="" +// export PROMETHEUS_PASS="" + +prometheus.exporter.databricks "example" { + // Required parameters + server_hostname = "dbc-abc123-def456.cloud.databricks.com" // Replace with your workspace hostname + warehouse_http_path = "/sql/1.0/warehouses/abc123def456" // Replace with your SQL Warehouse HTTP path + client_id = env("DATABRICKS_CLIENT_ID") + client_secret = env("DATABRICKS_CLIENT_SECRET") + + // Optional tuning parameters (all have defaults) + query_timeout = "5m" // Timeout for individual SQL queries (default: 5m) + billing_lookback = "24h" // How far back to query billing data (default: 24h, Databricks billing has 24-48h lag) + jobs_lookback = "3h" // How far back to query job runs (default: 3h) + pipelines_lookback = "3h" // How far back to query pipeline runs (default: 3h) + queries_lookback = "2h" // How far back to query SQL warehouse queries (default: 2h) + sla_threshold_seconds = 3600 // Duration threshold in seconds for job SLA miss detection (default: 3600) + collect_task_retries = false // Collect task retry metrics (default: false) ⚠️ HIGH CARDINALITY: adds task_key label +} + +// Configure a prometheus.scrape component to collect databricks metrics. +prometheus.scrape "databricks" { + targets = prometheus.exporter.databricks.example.targets + forward_to = [prometheus.remote_write.grafana_cloud.receiver] + scrape_interval = "10m" // Recommended: 10-30 minutes (System Table queries can be slow and costly) + scrape_timeout = "9m" // Must be < scrape_interval (typical scrapes take 90-120s) + + // Optional: Enable clustering for high availability + clustering { + enabled = true + } +} + +prometheus.remote_write "grafana_cloud" { + endpoint { + url = env("PROMETHEUS_URL") + + basic_auth { + username = env("PROMETHEUS_USER") + password = env("PROMETHEUS_PASS") + } + } +} + +// Optional: Add metric relabeling to reduce cardinality or filter metrics +// To use this, change the prometheus.scrape forward_to to: +// forward_to = [prometheus.relabel.databricks_metrics.receiver] +// +// prometheus.relabel "databricks_metrics" { +// forward_to = [prometheus.remote_write.grafana_cloud.receiver] +// +// // Example: Drop high-cardinality labels if needed +// rule { +// source_labels = ["task_key"] +// action = "labeldrop" +// } +// +// // Example: Keep only specific metrics +// rule { +// source_labels = ["__name__"] +// regex = "databricks_(up|billing_.*|job_run_status_total)" +// action = "keep" +// } +// } diff --git a/sample-apps/databricks/configs/alloy-simple.alloy b/sample-apps/databricks/configs/alloy-simple.alloy new file mode 100644 index 0000000..3ee85eb --- /dev/null +++ b/sample-apps/databricks/configs/alloy-simple.alloy @@ -0,0 +1,44 @@ +// Simple Databricks monitoring configuration for Grafana Alloy +// +// This configuration: +// - Scrapes metrics from Databricks System Tables using the built-in exporter +// - Forwards metrics to Grafana Cloud (or any Prometheus-compatible endpoint) +// - Uses environment variables for sensitive credentials +// +// Prerequisites: +// - Databricks workspace with Unity Catalog and System Tables enabled +// - Service Principal with OAuth2 M2M authentication configured +// - SQL Warehouse for querying System Tables (serverless recommended for cost efficiency) +// +// Set environment variables before starting Alloy: +// export DATABRICKS_CLIENT_ID="" +// export DATABRICKS_CLIENT_SECRET="" +// export PROMETHEUS_URL="https://prometheus-prod-us-central1.grafana.net/api/prom/push" +// export PROMETHEUS_USER="" +// export PROMETHEUS_PASS="" + +prometheus.exporter.databricks "example" { + server_hostname = "dbc-abc123-def456.cloud.databricks.com" // Replace with your workspace hostname + warehouse_http_path = "/sql/1.0/warehouses/abc123def456" // Replace with your SQL Warehouse HTTP path + client_id = env("DATABRICKS_CLIENT_ID") + client_secret = env("DATABRICKS_CLIENT_SECRET") +} + +// Configure a prometheus.scrape component to collect databricks metrics. +prometheus.scrape "databricks" { + targets = prometheus.exporter.databricks.example.targets + forward_to = [prometheus.remote_write.grafana_cloud.receiver] + scrape_interval = "10m" // Recommended: 10-30 minutes + scrape_timeout = "9m" // Must be < scrape_interval +} + +prometheus.remote_write "grafana_cloud" { + endpoint { + url = env("PROMETHEUS_URL") + + basic_auth { + username = env("PROMETHEUS_USER") + password = env("PROMETHEUS_PASS") + } + } +}