diff --git a/README.md b/README.md index ec5adbf7ac..b58a186a74 100644 --- a/README.md +++ b/README.md @@ -76,6 +76,8 @@ On some systems, the `timex` collector requires an additional Docker flag, There is varying support for collectors on each operating system. The tables below list all existing collectors and the supported systems. +For detailed per-collector documentation including metrics, labels, and configuration flags, see [docs/collectors/](./docs/collectors/). + Collectors are enabled by providing a `--collector.` flag. Collectors that are enabled by default can be disabled by providing a `--no-collector.` flag. To enable only some specific collector(s), use `--collector.disable-defaults --collector. ...`. diff --git a/docs/collectors/README.md b/docs/collectors/README.md new file mode 100644 index 0000000000..18c0ae4de8 --- /dev/null +++ b/docs/collectors/README.md @@ -0,0 +1,30 @@ +# Collector Documentation + +Per-collector metric documentation. Each file documents one collector. + +## Available Documentation + +- [cpu](cpu.md) - CPU time statistics and metadata +- [cpufreq](cpufreq.md) - CPU frequency scaling statistics +- [diskstats](diskstats.md) - Disk I/O statistics +- [filesystem](filesystem.md) - Filesystem space and inode statistics +- [hwmon](hwmon.md) - Hardware monitoring sensors +- [meminfo](meminfo.md) - Memory statistics +- [netdev](netdev.md) - Network interface statistics +- [netstat](netstat.md) - Network protocol statistics +- [stat](stat.md) - Kernel/system statistics + +## Structure + +See [_TEMPLATE.md](_TEMPLATE.md) for the documentation template. + +## Naming + +Files are named `.md` matching the collector registration name (e.g., `cpu.md`, `filesystem.md`). + +## Contributing + +When adding or modifying a collector: +1. Update or create the corresponding documentation file +2. Ensure all metrics are listed with correct types and labels +3. Document any configuration flags diff --git a/docs/collectors/_TEMPLATE.md b/docs/collectors/_TEMPLATE.md new file mode 100644 index 0000000000..668f97d98e --- /dev/null +++ b/docs/collectors/_TEMPLATE.md @@ -0,0 +1,58 @@ +# collector_name + +Brief description of what this collector exposes. + +Status: enabled|disabled by default + +## Platforms + +- Linux +- Darwin +- FreeBSD +- ... + +## Configuration + +``` +--collector.name.flag-name Description (default: value) +--collector.name.other-flag Description (default: value) +``` + +Omit this section if the collector has no flags. + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/example` | Brief description | +| `/sys/class/example` | Brief description | +| `syscall(2)` | Brief description | + +## Metrics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_example_total` | counter | `label1`, `label2` | Description | +| `node_example_bytes` | gauge | | Description | +| `node_example_info` | gauge | `key`, `value` | Info metric, always 1 | + +For collectors with dynamic metrics (e.g., meminfo), use: + +Metrics are derived from `/proc/meminfo`. Each field `FieldName` becomes `node_memory_fieldname_bytes`. + +## Labels + +| Label | Description | +|-------|-------------| +| `device` | Device name | +| `mountpoint` | Mount path | + +Omit this section if metrics have no labels or labels are self-explanatory. + +## Notes + +- Special behaviors, caveats, kernel version requirements +- Known issues or limitations +- Related collectors + +Omit this section if not applicable. diff --git a/docs/collectors/cpu.md b/docs/collectors/cpu.md new file mode 100644 index 0000000000..4989d1421c --- /dev/null +++ b/docs/collectors/cpu.md @@ -0,0 +1,70 @@ +# cpu + +Exposes CPU time statistics from `/proc/stat` and CPU metadata from `/proc/cpuinfo` and sysfs. + +Status: enabled by default + +## Platforms + +- Linux +- Darwin +- Dragonfly +- FreeBSD +- NetBSD +- OpenBSD +- Solaris +- AIX + +## Configuration + +``` +--collector.cpu.guest Enable node_cpu_guest_seconds_total metric (default: true) +--collector.cpu.info Enable node_cpu_info metric (default: false) +--collector.cpu.info.flags-include Regex filter for CPU flags to include in node_cpu_flag_info +--collector.cpu.info.bugs-include Regex filter for CPU bugs to include in node_cpu_bug_info +``` + +Setting `--collector.cpu.info.flags-include` or `--collector.cpu.info.bugs-include` implicitly enables `--collector.cpu.info`. + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/stat` | CPU time counters per core and mode | +| `/proc/cpuinfo` | CPU metadata (vendor, model, flags, bugs) | +| `/sys/devices/system/cpu/cpu*/topology/` | Physical package and core IDs | +| `/sys/devices/system/cpu/cpu*/thermal_throttle/` | Thermal throttling counters | +| `/sys/devices/system/cpu/cpu*/online` | CPU online status | +| `/sys/devices/system/cpu/isolated` | Isolated CPUs list | + +## Metrics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_cpu_seconds_total` | counter | `cpu`, `mode` | Seconds the CPUs spent in each mode | +| `node_cpu_guest_seconds_total` | counter | `cpu`, `mode` | Seconds the CPUs spent in guest (VM) mode | +| `node_cpu_info` | gauge | `package`, `core`, `cpu`, `vendor`, `family`, `model`, `model_name`, `microcode`, `stepping`, `cachesize` | CPU metadata, always 1 | +| `node_cpu_frequency_hertz` | gauge | `package`, `core`, `cpu` | CPU frequency from /proc/cpuinfo (only when cpufreq collector disabled) | +| `node_cpu_flag_info` | gauge | `flag` | CPU flag presence from first core, always 1 | +| `node_cpu_bug_info` | gauge | `bug` | CPU bug presence from first core, always 1 | +| `node_cpu_core_throttles_total` | counter | `package`, `core` | Thermal throttle events per core | +| `node_cpu_package_throttles_total` | counter | `package` | Thermal throttle events per package | +| `node_cpu_isolated` | gauge | `cpu` | CPU isolation status (1 if isolated) | +| `node_cpu_online` | gauge | `cpu` | CPU online status (1 if online) | + +## Labels + +| Label | Description | +|-------|-------------| +| `cpu` | Logical CPU number (0-indexed) | +| `mode` | CPU time mode: `user`, `nice`, `system`, `idle`, `iowait`, `irq`, `softirq`, `steal` | +| `package` | Physical CPU package ID | +| `core` | Physical core ID within package | + +## Notes + +- `node_cpu_guest_seconds_total` values are also included in `node_cpu_seconds_total` (user and nice modes) +- Counter values may jump backwards on CPU hotplug events; the collector handles this by resetting stats when idle jumps back more than 3 seconds +- `node_cpu_flag_info` and `node_cpu_bug_info` are only exposed from the first CPU core +- `node_cpu_frequency_hertz` is only exposed when the `cpufreq` collector is disabled to avoid duplicate metrics +- Linux-specific metrics: throttle counters, isolated, online status diff --git a/docs/collectors/cpufreq.md b/docs/collectors/cpufreq.md new file mode 100644 index 0000000000..96a629aff4 --- /dev/null +++ b/docs/collectors/cpufreq.md @@ -0,0 +1,46 @@ +# cpufreq + +Exposes CPU frequency scaling statistics from sysfs. + +Status: enabled by default + +## Platforms + +- Linux +- Solaris + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/sys/devices/system/cpu/cpu*/cpufreq/` | Per-CPU frequency scaling data | + +Kernel documentation: +- https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt +- https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt + +## Metrics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_cpu_frequency_hertz` | gauge | `cpu` | Current CPU thread frequency in hertz | +| `node_cpu_frequency_min_hertz` | gauge | `cpu` | Minimum CPU thread frequency in hertz | +| `node_cpu_frequency_max_hertz` | gauge | `cpu` | Maximum CPU thread frequency in hertz | +| `node_cpu_scaling_frequency_hertz` | gauge | `cpu` | Current scaled CPU thread frequency in hertz | +| `node_cpu_scaling_frequency_min_hertz` | gauge | `cpu` | Minimum scaled CPU thread frequency in hertz | +| `node_cpu_scaling_frequency_max_hertz` | gauge | `cpu` | Maximum scaled CPU thread frequency in hertz | +| `node_cpu_scaling_governor` | gauge | `cpu`, `governor` | Current CPU frequency governor (1 if active, 0 otherwise) | + +## Labels + +| Label | Description | +|-------|-------------| +| `cpu` | CPU name from sysfs (e.g., `cpu0`) | +| `governor` | Frequency governor name (e.g., `performance`, `powersave`, `ondemand`) | + +## Notes + +- Sysfs values are in kHz; the collector converts to Hz +- Metrics without `scaling` in the name reflect hardware limits from cpuinfo files; `scaling_*` metrics reflect current governor policy limits +- `node_cpu_scaling_governor` emits one metric per available governor per CPU, with value 1 for the active governor +- When this collector is enabled, the `cpu` collector does not expose `node_cpu_frequency_hertz` to avoid duplication diff --git a/docs/collectors/diskstats.md b/docs/collectors/diskstats.md new file mode 100644 index 0000000000..9b07717490 --- /dev/null +++ b/docs/collectors/diskstats.md @@ -0,0 +1,115 @@ +# diskstats + +Exposes disk I/O statistics from `/proc/diskstats` and block device metadata from sysfs and udev. + +Status: enabled by default + +## Platforms + +- Linux +- Darwin +- OpenBSD +- AIX + +## Configuration + +``` +--collector.diskstats.device-include Regexp of devices to include (mutually exclusive with device-exclude) +--collector.diskstats.device-exclude Regexp of devices to exclude (default: ^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$) +``` + +### Examples + +Monitor only physical disks (exclude partitions, loop, ram): +``` +--collector.diskstats.device-exclude="^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$" +``` + +Monitor only NVMe devices: +``` +--collector.diskstats.device-include="^nvme[0-9]+n[0-9]+$" +``` + +Monitor only SCSI/SATA disks (sd*): +``` +--collector.diskstats.device-include="^sd[a-z]+$" +``` + +Exclude virtual and removable devices: +``` +--collector.diskstats.device-exclude="^(z?ram|loop|fd|sr|cd)[0-9]*$" +``` + +Include partitions for a specific disk: +``` +--collector.diskstats.device-include="^sda[0-9]*$" +``` + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/diskstats` | Disk I/O statistics | +| `/sys/block//` | Block device attributes | +| `/sys/block//queue/` | Block device queue stats | +| `/run/udev/data/b:` | Udev device properties | + +Kernel documentation: https://www.kernel.org/doc/Documentation/iostats.txt + +## Metrics + +### I/O Statistics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_disk_reads_completed_total` | counter | `device` | Total number of reads completed successfully | +| `node_disk_reads_merged_total` | counter | `device` | Total number of reads merged | +| `node_disk_read_bytes_total` | counter | `device` | Total number of bytes read successfully | +| `node_disk_read_time_seconds_total` | counter | `device` | Total seconds spent by all reads | +| `node_disk_writes_completed_total` | counter | `device` | Total number of writes completed successfully | +| `node_disk_writes_merged_total` | counter | `device` | Total number of writes merged | +| `node_disk_written_bytes_total` | counter | `device` | Total number of bytes written successfully | +| `node_disk_write_time_seconds_total` | counter | `device` | Total seconds spent by all writes | +| `node_disk_io_now` | gauge | `device` | Number of I/Os currently in progress | +| `node_disk_io_time_seconds_total` | counter | `device` | Total seconds spent doing I/Os | +| `node_disk_io_time_weighted_seconds_total` | counter | `device` | Weighted seconds spent doing I/Os | + +### Discard Statistics (Linux 4.18+) + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_disk_discards_completed_total` | counter | `device` | Total number of discards completed successfully | +| `node_disk_discards_merged_total` | counter | `device` | Total number of discards merged | +| `node_disk_discarded_sectors_total` | counter | `device` | Total number of sectors discarded successfully | +| `node_disk_discard_time_seconds_total` | counter | `device` | Total seconds spent by all discards | + +### Flush Statistics (Linux 5.5+) + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_disk_flush_requests_total` | counter | `device` | Total number of flush requests completed successfully | +| `node_disk_flush_requests_time_seconds_total` | counter | `device` | Total seconds spent by all flush requests | + +### Device Info + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_disk_info` | gauge | `device`, `major`, `minor`, `path`, `wwn`, `model`, `serial`, `revision`, `rotational` | Block device info, always 1 | +| `node_disk_filesystem_info` | gauge | `device`, `type`, `usage`, `uuid`, `version` | Filesystem info from udev, always 1 | +| `node_disk_device_mapper_info` | gauge | `device`, `name`, `uuid`, `vg_name`, `lv_name`, `lv_layer` | Device mapper info, always 1 | + +### ATA Device Attributes + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_disk_ata_write_cache` | gauge | `device` | ATA disk has a write cache (1 if true) | +| `node_disk_ata_write_cache_enabled` | gauge | `device` | ATA disk write cache is enabled (1 if true) | +| `node_disk_ata_rotation_rate_rpm` | gauge | `device` | ATA disk rotation rate in RPM (0 for SSDs) | + +## Notes + +- Sector sizes in `/proc/diskstats` are always 512 bytes regardless of actual device sector size +- Time values in the kernel are in milliseconds; the collector converts to seconds +- Udev info metrics require readable `/run/udev/data/` directory +- Discard and flush metrics availability depends on kernel version +- The default exclude pattern filters out partition devices and RAM/loop devices diff --git a/docs/collectors/filesystem.md b/docs/collectors/filesystem.md new file mode 100644 index 0000000000..6249b6a4bb --- /dev/null +++ b/docs/collectors/filesystem.md @@ -0,0 +1,71 @@ +# filesystem + +Exposes filesystem statistics including space usage and inode counts. + +Status: enabled by default + +## Platforms + +- Linux +- Darwin +- FreeBSD +- NetBSD +- OpenBSD +- Dragonfly +- AIX + +## Configuration + +``` +--collector.filesystem.mount-points-exclude Regexp of mount points to exclude (mutually exclusive to mount-points-include) +--collector.filesystem.mount-points-include Regexp of mount points to include (mutually exclusive to mount-points-exclude) +--collector.filesystem.fs-types-exclude Regexp of filesystem types to exclude (mutually exclusive to fs-types-include) +--collector.filesystem.fs-types-include Regexp of filesystem types to include (mutually exclusive to fs-types-exclude) +``` + +Default exclusions vary by platform. On Linux, virtual filesystems like `tmpfs`, `devtmpfs`, `sysfs`, `proc` are excluded by default. + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/self/mounts` | Mount points (Linux) | +| `/proc/self/mountinfo` | Mount info with major/minor device numbers (Linux) | +| `statfs(2)` | Filesystem statistics syscall | + +Documentation: +- https://docs.kernel.org/filesystems/proc.html +- `statfs(2)` manpage + +## Metrics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_filesystem_size_bytes` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem size in bytes | +| `node_filesystem_free_bytes` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem free space in bytes | +| `node_filesystem_avail_bytes` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem space available to non-root users in bytes | +| `node_filesystem_files` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem total file nodes (inodes) | +| `node_filesystem_files_free` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem free file nodes (inodes) | +| `node_filesystem_readonly` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Filesystem read-only status (1 = read-only) | +| `node_filesystem_device_error` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Error occurred getting statistics (1 = error) | +| `node_filesystem_mount_info` | gauge | `device`, `major`, `minor`, `mountpoint` | Filesystem mount information (always 1) | +| `node_filesystem_purgeable_bytes` | gauge | `device`, `mountpoint`, `fstype`, `device_error` | Purgeable space in bytes (Darwin only) | + +## Labels + +| Label | Description | +|-------|-------------| +| `device` | Block device path (e.g., `/dev/sda1`) | +| `mountpoint` | Mount path (e.g., `/`, `/home`) | +| `fstype` | Filesystem type (e.g., `ext4`, `xfs`, `btrfs`) | +| `device_error` | Error message if device stat failed, empty otherwise | +| `major` | Device major number (mount_info only) | +| `minor` | Device minor number (mount_info only) | + +## Notes + +- `free_bytes` includes reserved blocks; `avail_bytes` is what non-root users can use +- When `device_error` is set (value = 1), only `readonly` and `device_error` metrics are emitted +- Duplicate mounts (same device, mountpoint, fstype) are deduplicated +- Network filesystems may cause hangs if unreachable; consider excluding with `--collector.filesystem.fs-types-exclude` +- `purgeable_bytes` is Darwin-specific and includes space reclaimable by the OS diff --git a/docs/collectors/hwmon.md b/docs/collectors/hwmon.md new file mode 100644 index 0000000000..0144afbb84 --- /dev/null +++ b/docs/collectors/hwmon.md @@ -0,0 +1,152 @@ +# hwmon + +Exposes hardware monitoring statistics from `/sys/class/hwmon/`, similar to `lm-sensors`. + +Status: enabled by default + +## Platforms + +- Linux + +## Configuration + +``` +--collector.hwmon.chip-include Regexp of chips to include (mutually exclusive to chip-exclude) +--collector.hwmon.chip-exclude Regexp of chips to exclude (mutually exclusive to chip-include) +--collector.hwmon.sensor-include Regexp of sensors to include (mutually exclusive to sensor-exclude) +--collector.hwmon.sensor-exclude Regexp of sensors to exclude (mutually exclusive to sensor-include) +``` + +### Examples + +Exclude a specific chip: +``` +--collector.hwmon.chip-exclude="^platform_thinkpad_hwmon$" +``` + +Monitor only coretemp sensors: +``` +--collector.hwmon.chip-include="^platform_coretemp.*" +``` + +Exclude specific sensor on specific chip (format: `chip;sensor`): +``` +--collector.hwmon.sensor-exclude="platform_coretemp_0;temp3" +``` + +Monitor only temperature sensors: +``` +--collector.hwmon.sensor-include=";temp[0-9]+" +``` + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/sys/class/hwmon/` | Hardware monitoring chips and sensors | + +Documentation: +- https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface +- `sensors(1)` manpage (lm-sensors) + +## Metrics + +All metrics have `chip` and `sensor` labels. + +### Metadata + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_hwmon_chip_names` | gauge | `chip`, `chip_name` | Human-readable chip name annotation (always 1) | +| `node_hwmon_sensor_label` | gauge | `chip`, `sensor`, `label` | Sensor label annotation (always 1) | + +### Temperature + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_temp_celsius` | gauge | Temperature reading in Celsius | +| `node_hwmon_temp_crit_celsius` | gauge | Critical temperature threshold | +| `node_hwmon_temp_crit_alarm_celsius` | gauge | Critical alarm temperature | +| `node_hwmon_temp_max_celsius` | gauge | Maximum temperature threshold | +| `node_hwmon_temp_min_celsius` | gauge | Minimum temperature threshold | + +### Voltage + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_in_volts` | gauge | Voltage reading in volts | +| `node_hwmon_in_min_volts` | gauge | Minimum voltage threshold | +| `node_hwmon_in_max_volts` | gauge | Maximum voltage threshold | +| `node_hwmon_in_crit_volts` | gauge | Critical voltage threshold | +| `node_hwmon_cpu_volts` | gauge | CPU voltage in volts | + +### Fan + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_fan_rpm` | gauge | Fan speed in RPM | +| `node_hwmon_fan_min_rpm` | gauge | Minimum fan speed | +| `node_hwmon_fan_max_rpm` | gauge | Maximum fan speed | +| `node_hwmon_fan_target_rpm` | gauge | Target fan speed | +| `node_hwmon_fan_alarm` | gauge | Fan alarm status | +| `node_hwmon_fan_fault` | gauge | Fan fault status | + +### Power + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_power_watt` | gauge | Power usage in watts | +| `node_hwmon_power_max_watt` | gauge | Maximum power | +| `node_hwmon_power_crit_watt` | gauge | Critical power threshold | +| `node_hwmon_power_accuracy` | gauge | Power meter accuracy ratio | +| `node_hwmon_power_average_interval_seconds` | gauge | Power averaging interval | + +### Current + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_curr_amps` | gauge | Current in amperes | +| `node_hwmon_curr_min_amps` | gauge | Minimum current threshold | +| `node_hwmon_curr_max_amps` | gauge | Maximum current threshold | +| `node_hwmon_curr_crit_amps` | gauge | Critical current threshold | + +### Energy + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_energy_joule_total` | counter | Total energy consumed in joules | + +### PWM + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_pwm` | gauge | PWM value (0-255) | +| `node_hwmon_pwm_enable` | gauge | PWM control mode | + +### Other + +| Metric | Type | Description | +|--------|------|-------------| +| `node_hwmon_humidity` | gauge | Humidity as ratio (multiply by 100 for percentage) | +| `node_hwmon_intrusion_alarm` | gauge | Chassis intrusion detection | +| `node_hwmon_freq_freq_mhz` | gauge | GPU frequency in MHz | +| `node_hwmon_beep_enabled` | gauge | Beep enabled status | +| `node_hwmon_voltage_regulator_version` | gauge | VRM version | +| `node_hwmon_update_interval_seconds` | gauge | Sensor update interval | + +## Labels + +| Label | Description | +|-------|-------------| +| `chip` | Chip identifier derived from device path or name (e.g., `platform_coretemp_0`, `pci0000_00_1f_3`) | +| `sensor` | Sensor identifier (e.g., `temp1`, `fan2`, `in0`) | +| `chip_name` | Human-readable chip name from sysfs (chip_names metric only) | +| `label` | Sensor label from sysfs if available (sensor_label metric only) | + +## Notes + +- Chip names are derived from device paths to ensure stability across reboots (hwmon numbering can change) +- Sensor filtering uses format `chip;sensor` to allow per-chip sensor exclusion +- Raw sysfs values are converted to standard units (millivolts -> volts, millidegrees -> degrees) +- Some drivers return EAGAIN; the collector handles this gracefully +- Use `sensors` command from lm-sensors to explore available sensors diff --git a/docs/collectors/meminfo.md b/docs/collectors/meminfo.md new file mode 100644 index 0000000000..f1bcc20b85 --- /dev/null +++ b/docs/collectors/meminfo.md @@ -0,0 +1,96 @@ +# meminfo + +Exposes memory statistics from `/proc/meminfo`. + +Status: enabled by default + +## Platforms + +- Linux +- Darwin +- OpenBSD +- NetBSD +- AIX + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/meminfo` | Memory statistics | + +Kernel documentation: https://www.kernel.org/doc/Documentation/filesystems/proc.txt (search for "meminfo") + +## Metrics + +Metrics are dynamically generated from `/proc/meminfo` fields. Each field `FieldName` with value in kB becomes `node_memory_FieldName_bytes` (converted to bytes). + +### Common Metrics + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_MemTotal_bytes` | gauge | Total usable RAM | +| `node_memory_MemFree_bytes` | gauge | Free RAM | +| `node_memory_MemAvailable_bytes` | gauge | Available memory for starting new applications | +| `node_memory_Buffers_bytes` | gauge | Memory used by kernel buffers | +| `node_memory_Cached_bytes` | gauge | Memory used by page cache and slabs | +| `node_memory_SwapTotal_bytes` | gauge | Total swap space | +| `node_memory_SwapFree_bytes` | gauge | Free swap space | +| `node_memory_SwapCached_bytes` | gauge | Swap space cached in RAM | + +### Active/Inactive Memory + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_Active_bytes` | gauge | Memory recently used | +| `node_memory_Inactive_bytes` | gauge | Memory not recently used | +| `node_memory_Active_anon_bytes` | gauge | Active anonymous memory | +| `node_memory_Inactive_anon_bytes` | gauge | Inactive anonymous memory | +| `node_memory_Active_file_bytes` | gauge | Active file-backed memory | +| `node_memory_Inactive_file_bytes` | gauge | Inactive file-backed memory | + +### Slab Memory + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_Slab_bytes` | gauge | Kernel slab memory | +| `node_memory_SReclaimable_bytes` | gauge | Reclaimable slab memory | +| `node_memory_SUnreclaim_bytes` | gauge | Unreclaimable slab memory | + +### Huge Pages + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_HugePages_Total` | gauge | Total huge pages (count, not bytes) | +| `node_memory_HugePages_Free` | gauge | Free huge pages (count) | +| `node_memory_HugePages_Rsvd` | gauge | Reserved huge pages (count) | +| `node_memory_HugePages_Surp` | gauge | Surplus huge pages (count) | +| `node_memory_Hugepagesize_bytes` | gauge | Size of each huge page | + +### Virtual Memory + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_VmallocTotal_bytes` | gauge | Total vmalloc address space | +| `node_memory_VmallocUsed_bytes` | gauge | Used vmalloc address space | +| `node_memory_VmallocChunk_bytes` | gauge | Largest contiguous vmalloc block | + +### Other + +| Metric | Type | Description | +|--------|------|-------------| +| `node_memory_Dirty_bytes` | gauge | Memory waiting to be written to disk | +| `node_memory_Writeback_bytes` | gauge | Memory being written to disk | +| `node_memory_Mapped_bytes` | gauge | Files mapped into memory | +| `node_memory_Shmem_bytes` | gauge | Shared memory | +| `node_memory_KernelStack_bytes` | gauge | Kernel stack memory | +| `node_memory_PageTables_bytes` | gauge | Page table memory | +| `node_memory_CommitLimit_bytes` | gauge | Total memory available for allocation | +| `node_memory_Committed_AS_bytes` | gauge | Total memory allocated | + +## Notes + +- Available metrics vary by kernel version and configuration +- `MemAvailable` requires Linux 3.14+ +- HugePages metrics are counts, not byte values +- All meminfo metrics are gauges +- Darwin, OpenBSD, NetBSD, and AIX have platform-specific implementations with different available metrics diff --git a/docs/collectors/netdev.md b/docs/collectors/netdev.md new file mode 100644 index 0000000000..cba4638504 --- /dev/null +++ b/docs/collectors/netdev.md @@ -0,0 +1,137 @@ +# netdev + +Exposes network interface statistics such as bytes transferred, packets, errors, and drops. + +Status: enabled by default + +## Platforms + +- Linux +- Darwin +- FreeBSD +- OpenBSD +- Dragonfly +- AIX + +## Configuration + +``` +--collector.netdev.device-include Regexp of devices to include (mutually exclusive to device-exclude) +--collector.netdev.device-exclude Regexp of devices to exclude (mutually exclusive to device-include) +--collector.netdev.address-info Collect address info for every device (default: false) +--collector.netdev.enable-detailed-metrics Use detailed metric names on Linux (default: false) +``` + +### Examples + +Exclude virtual and container interfaces: +``` +--collector.netdev.device-exclude="^(veth|docker|br-|virbr|cni|flannel|cali).*" +``` + +Monitor only physical ethernet interfaces: +``` +--collector.netdev.device-include="^(eth|ens|enp|eno)[0-9]+" +``` + +Exclude loopback only: +``` +--collector.netdev.device-exclude="^lo$" +``` + +Include bonded interfaces and their members: +``` +--collector.netdev.device-include="^(bond[0-9]+|eth[0-9]+)$" +``` + +Enable IP address information for all interfaces: +``` +--collector.netdev.address-info +``` + +Use detailed error metrics (breaks compatibility with default metric names): +``` +--collector.netdev.enable-detailed-metrics +``` + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/net/dev` | Network device statistics (Linux) | +| `/sys/class/net/` | Network device info (Linux) | +| `getifaddrs(3)` | Interface addresses (all platforms) | + +Documentation: +- https://docs.kernel.org/networking/statistics.html +- `netdevice(7)` manpage + +## Metrics + +All metrics have the `device` label and are counters with `_total` suffix. + +### Standard Metrics (default) + +| Metric | Type | Description | +|--------|------|-------------| +| `node_network_receive_bytes_total` | counter | Bytes received | +| `node_network_receive_packets_total` | counter | Packets received | +| `node_network_receive_errs_total` | counter | Receive errors | +| `node_network_receive_drop_total` | counter | Packets dropped on receive | +| `node_network_receive_fifo_total` | counter | FIFO buffer errors on receive | +| `node_network_receive_frame_total` | counter | Frame errors on receive | +| `node_network_receive_compressed_total` | counter | Compressed packets received | +| `node_network_receive_multicast_total` | counter | Multicast packets received | +| `node_network_transmit_bytes_total` | counter | Bytes transmitted | +| `node_network_transmit_packets_total` | counter | Packets transmitted | +| `node_network_transmit_errs_total` | counter | Transmit errors | +| `node_network_transmit_drop_total` | counter | Packets dropped on transmit | +| `node_network_transmit_fifo_total` | counter | FIFO buffer errors on transmit | +| `node_network_transmit_colls_total` | counter | Collisions detected | +| `node_network_transmit_carrier_total` | counter | Carrier errors on transmit | +| `node_network_transmit_compressed_total` | counter | Compressed packets transmitted | + +### Detailed Metrics (--collector.netdev.enable-detailed-metrics) + +When enabled, exposes more granular error counters instead of aggregated values: + +| Metric | Type | Description | +|--------|------|-------------| +| `node_network_receive_errors_total` | counter | Total receive errors | +| `node_network_receive_dropped_total` | counter | Dropped packets (excludes missed) | +| `node_network_receive_missed_errors_total` | counter | Missed packets | +| `node_network_receive_fifo_errors_total` | counter | FIFO overrun errors | +| `node_network_receive_length_errors_total` | counter | Length errors | +| `node_network_receive_over_errors_total` | counter | Ring buffer overflow | +| `node_network_receive_crc_errors_total` | counter | CRC errors | +| `node_network_receive_frame_errors_total` | counter | Frame alignment errors | +| `node_network_transmit_errors_total` | counter | Total transmit errors | +| `node_network_transmit_dropped_total` | counter | Dropped packets | +| `node_network_transmit_fifo_errors_total` | counter | FIFO errors | +| `node_network_transmit_aborted_errors_total` | counter | Aborted transmissions | +| `node_network_transmit_carrier_errors_total` | counter | Carrier errors | +| `node_network_transmit_heartbeat_errors_total` | counter | Heartbeat errors | +| `node_network_transmit_window_errors_total` | counter | Window errors | +| `node_network_collisions_total` | counter | Collision count | + +### Address Info (--collector.netdev.address-info) + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_network_address_info` | gauge | `device`, `address`, `netmask`, `scope` | Network address info (always 1) | + +## Labels + +| Label | Description | +|-------|-------------| +| `device` | Interface name (e.g., `eth0`, `ens192`, `lo`) | +| `address` | IP address (address_info only) | +| `netmask` | CIDR prefix length (address_info only) | +| `scope` | Address scope: `global`, `link-local`, `interface-local` (address_info only) | + +## Notes + +- Default metrics match `/proc/net/dev` column names for compatibility +- Detailed metrics provide per-error-type breakdown but change metric names +- Virtual interfaces (veth, docker, etc.) are included by default; use `--collector.netdev.device-exclude` to filter +- Loopback (`lo`) is included by default diff --git a/docs/collectors/netstat.md b/docs/collectors/netstat.md new file mode 100644 index 0000000000..34ca1a9dfe --- /dev/null +++ b/docs/collectors/netstat.md @@ -0,0 +1,134 @@ +# netstat + +Exposes network statistics from `/proc/net/netstat`, `/proc/net/snmp`, and `/proc/net/snmp6`. + +Status: enabled by default + +## Platforms + +- Linux + +## Configuration + +``` +--collector.netstat.fields Regexp of fields to return (default: see below) +``` + +Default pattern: +``` +^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPTimeouts|TCPOFOQueue|TCPRcvQDrop)|Tcp_(ActiveOpens|InSegs|OutSegs|OutRsts|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$ +``` + +### Examples + +Expose all available metrics: +``` +--collector.netstat.fields=".*" +``` + +TCP metrics only (basic and extended): +``` +--collector.netstat.fields="^Tcp(Ext)?_.*" +``` + +Only error-related metrics: +``` +--collector.netstat.fields=".*_(InErrors|InErrs|Drops|Timeouts|Retrans).*" +``` + +Minimal set (bytes in/out and established connections): +``` +--collector.netstat.fields="^(IpExt_(InOctets|OutOctets)|Tcp_CurrEstab)$" +``` + +Add memory pressure metrics to the default set: +``` +--collector.netstat.fields="^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPTimeouts|TCPOFOQueue|TCPRcvQDrop|TCPMemoryPressures.*)|Tcp_(ActiveOpens|InSegs|OutSegs|OutRsts|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$" +``` + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/net/netstat` | Extended TCP statistics (TcpExt, IpExt) | +| `/proc/net/snmp` | SNMP MIB statistics (Ip, Icmp, Tcp, Udp) | +| `/proc/net/snmp6` | IPv6 SNMP statistics (Ip6, Icmp6, Udp6) | + +Documentation: +- https://docs.kernel.org/networking/snmp_counter.html +- https://docs.kernel.org/filesystems/proc.html (Table 1-9: Network info in /proc/net) +- `netstat(8)` manpage (`netstat -s` displays the same statistics) + +## Metrics + +Metrics are dynamically generated as `node_netstat__` based on the fields regex filter. + +### Default Exposed Metrics + +#### IP + +| Metric | Type | Description | +|--------|------|-------------| +| `node_netstat_Ip_Forwarding` | untyped | IP forwarding status | +| `node_netstat_IpExt_InOctets` | untyped | Total incoming bytes | +| `node_netstat_IpExt_OutOctets` | untyped | Total outgoing bytes | +| `node_netstat_Ip6_InOctets` | untyped | Total incoming IPv6 bytes | +| `node_netstat_Ip6_OutOctets` | untyped | Total outgoing IPv6 bytes | + +#### ICMP + +| Metric | Type | Description | +|--------|------|-------------| +| `node_netstat_Icmp_InMsgs` | untyped | Total incoming ICMP messages | +| `node_netstat_Icmp_OutMsgs` | untyped | Total outgoing ICMP messages | +| `node_netstat_Icmp6_InMsgs` | untyped | Total incoming ICMPv6 messages | +| `node_netstat_Icmp6_OutMsgs` | untyped | Total outgoing ICMPv6 messages | + +#### TCP + +| Metric | Type | Description | +|--------|------|-------------| +| `node_netstat_Tcp_ActiveOpens` | untyped | Active connection openings | +| `node_netstat_Tcp_PassiveOpens` | untyped | Passive connection openings | +| `node_netstat_Tcp_InSegs` | untyped | Incoming segments | +| `node_netstat_Tcp_OutSegs` | untyped | Outgoing segments | +| `node_netstat_Tcp_RetransSegs` | untyped | Retransmitted segments | +| `node_netstat_Tcp_OutRsts` | untyped | Outgoing resets | +| `node_netstat_Tcp_CurrEstab` | untyped | Currently established connections | + +#### TCP Extended + +| Metric | Type | Description | +|--------|------|-------------| +| `node_netstat_TcpExt_ListenOverflows` | untyped | Listen queue overflows | +| `node_netstat_TcpExt_ListenDrops` | untyped | Dropped incoming connections | +| `node_netstat_TcpExt_SyncookiesSent` | untyped | SYN cookies sent | +| `node_netstat_TcpExt_SyncookiesRecv` | untyped | SYN cookies received | +| `node_netstat_TcpExt_SyncookiesFailed` | untyped | SYN cookies failed | +| `node_netstat_TcpExt_TCPSynRetrans` | untyped | SYN retransmissions | +| `node_netstat_TcpExt_TCPTimeouts` | untyped | TCP timeouts | +| `node_netstat_TcpExt_TCPOFOQueue` | untyped | Out-of-order queue usage | +| `node_netstat_TcpExt_TCPRcvQDrop` | untyped | Receive queue drops | + +#### UDP + +| Metric | Type | Description | +|--------|------|-------------| +| `node_netstat_Udp_InDatagrams` | untyped | Incoming UDP datagrams | +| `node_netstat_Udp_OutDatagrams` | untyped | Outgoing UDP datagrams | +| `node_netstat_Udp_NoPorts` | untyped | Datagrams to unknown ports | +| `node_netstat_Udp_RcvbufErrors` | untyped | Receive buffer errors | +| `node_netstat_Udp_SndbufErrors` | untyped | Send buffer errors | +| `node_netstat_Udp6_InDatagrams` | untyped | Incoming UDPv6 datagrams | +| `node_netstat_Udp6_OutDatagrams` | untyped | Outgoing UDPv6 datagrams | + +#### Error Metrics + +Any field matching `.*_(InErrors|InErrs)` is included by default. + +## Notes + +- All metrics are exposed as `untyped` since the kernel doesn't indicate whether values are counters or gauges +- `/proc/net/snmp6` may not exist on systems with IPv6 disabled +- Customize `--collector.netstat.fields` to expose additional or fewer metrics +- Field names match the kernel's naming convention exactly diff --git a/docs/collectors/stat.md b/docs/collectors/stat.md new file mode 100644 index 0000000000..da276553c2 --- /dev/null +++ b/docs/collectors/stat.md @@ -0,0 +1,64 @@ +# stat + +Exposes kernel and system statistics from `/proc/stat`. + +Status: enabled by default + +## Platforms + +- Linux + +## Configuration + +``` +--collector.stat.softirq Export softirq calls per vector (default: false) +``` + +## Data Sources + +| Source | Description | +|--------|-------------| +| `/proc/stat` | Kernel/system statistics | + +Documentation: +- https://docs.kernel.org/filesystems/proc.html +- `proc(5)` manpage + +## Metrics + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_intr_total` | counter | | Total number of interrupts serviced | +| `node_context_switches_total` | counter | | Total number of context switches | +| `node_forks_total` | counter | | Total number of forks | +| `node_boot_time_seconds` | gauge | | Node boot time in Unix timestamp | +| `node_procs_running` | gauge | | Number of processes in runnable state | +| `node_procs_blocked` | gauge | | Number of processes blocked waiting for I/O | + +### Softirq Metrics (--collector.stat.softirq) + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `node_softirqs_total` | counter | `vector` | Number of softirq calls per vector | + +Softirq vectors: + +| Vector | Description | +|--------|-------------| +| `hi` | High-priority tasklets | +| `timer` | Timer interrupts | +| `net_tx` | Network transmit | +| `net_rx` | Network receive | +| `block` | Block device | +| `block_iopoll` | Block I/O polling | +| `tasklet` | Tasklet processing | +| `sched` | Scheduler | +| `hrtimer` | High-resolution timer | +| `rcu` | Read-copy-update | + +## Notes + +- `node_boot_time_seconds` is a Unix timestamp; use `time() - node_boot_time_seconds` for uptime +- `node_intr_total` is the sum of all interrupt counts; for per-interrupt details, use the `interrupts` collector +- `node_procs_running` and `node_procs_blocked` are instantaneous values +- Softirq metrics are disabled by default to reduce cardinality; for detailed softirq stats, see the `softirqs` collector