Skip to content

Healthcheck and stats for monitoring #181

@ThomDietrich

Description

@ThomDietrich

Hey @djmaze and all,

not sure if you remember me. We did work on some good little improvements some years ago.
Since then, I've been a happy user of your image. One constant problem I had is lack of visibility. Backups could be paused because of an issue for months, until I eventually get a hold of that. This is of course partially my fault, but also the reason why the observability industry is thriving :)

I would like to discuss how this image could provide users with actionable and informative data on the activities of their backup jobs. Specifically,

  • A counter for consecutive errors (to delay warnings and notifications beyond a single hick-up)
  • An indicator for permanent errors (if possible)
  • Timestamp for the last successful sync (for downtime detection and notification)
  • Performance stats (because 📊🤩)

How does that sound?

In #171 (comment) you mentioned your solution to some of these points: Healthchecks.io. The service looks good but does not (I believe) solve all of the above. Also, everyone kind of tends to use different tools (I'd like to link Grafana), gladly most of them cater to the same needs.

Long story short, I propose to

  1. Generate a string of stats after each backup run. This string could be in any of the common formats, like JSON, Prometheus, or ... / or even translated to some of them
  2. Provide the stats string as an env variable to POST_COMMANDS_SUCCESS etc. for e.g. healthchecks.io, telegraf, or prometheus (push principle)
  3. Provide the stats string via an http endpoint (pub principle)

What do you think? Cheers!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions