An unsupervised anomaly detection model for identifying outliers in tabular data using graph-based similarity analysis. Applicable to diverse data types—such as cybersecurity, materials testing, and GIS—providing unsupervised detection of unusual patterns while reducing false positives with a multi-filter process.
Copyright 2025, Battelle Energy Alliance, LLC, ALL RIGHTS RESERVED
import HSA_Classes.HSA_Pipeline as HSA_pipe
import pandas as pd
from loguru import logger
# Load your preprocessed data
data = pd.read_pickle("data.pkl")
# Initialize the model with default parameters
model = HSA_pipe.HSA_pipeline(
data,
penalty_ratio=0.75,
cutoff_distance=3,
lr=2.73,
anomaly_std_tolerance=1.5,
bin_count=14,
max_spawn_dummies=500,
percent_variance_explained=0.95,
min_additional_percent_variance_exp=0.005,
logger=logger,
logging_level="INFO"
)
# Run anomaly detection
results_df = model.infer()- max_spawn_dummies (default: 500): Upper limit on encoding space size per feature
- percent_variance_explained (default: 0.95): PCA variance threshold for feature selection
- min_additional_percent_variance_exp (default: 0.005): Minimum variance for keeping features
- penalty_ratio (default: 0.75): Controls model generalizability vs specificity
- cutoff_distance (default: 3): Similarity threshold for data relationships
- lr (default: 2.73): Learning rate for optimization
- anomaly_std_tolerance (default: 1.5): Standard deviations from mean to classify as anomaly
- bin_count (default: 14): Minimum prediction count in multifilter stage
The HSA pipeline consists of three main components:
Handles categorical and object data encoding, converts time-like objects to PyTorch-compatible data types, and scales the data. Uses PCA via sklearn.decomposition to select relevant features based on variance explained thresholds. Returns a pandas DataFrame of raw data for reporting, a scaled/encoded/PCA numpy array for model training, and fitted sklearn decomposer and scaler objects.
The core anomaly detection class that generates model weights through vertex weight calculations and distance computations. Deviates from the original paper's approach by using matrix operations instead of loops for significant performance improvements. Creates affinity matrices, similarity matrices, and other mathematical constructs, then evolves these matrices through different powers to analyze relationships across multiple topological scales. Uses PyTorch's Adam optimizer to minimize the penalized objective function and determine anomaly scores.
To reduce false positives, the model employs a multi-filter strategy:
- Collect all anomalous predictions from individual batches
- Create a balanced dataset (10% anomalous, 90% normal)
- Re-run detection on this mixed dataset multiple times
- Count prediction frequency for each potential anomaly
- Filter out inconsistent predictions based on
bin_countthreshold
This approach allows local anomalies to be compared against the global dataset, reducing false positives from batch-specific artifacts.
Provides model explainability through visualization tools. The heatmap_weights_matrix method generates heatmaps of model weights, matrices, and preprocessed data to show how anomalous events create "hot spots" that propagate through optimization. The heatmap_bin_predictions method creates visualizations of multifilter data, anomaly score locations, and score values for visual inspection of results.
HSA detects anomalies by:
- Computing similarity relationships between data points using Euclidean distances
- Building affinity matrices that capture multi-scale topological relationships
- Optimizing anomaly scores through a penalized objective function
- Using a multifilter approach to reduce false positives
The model is particularly effective at finding anomalies that are dissimilar in feature space rather than just geometric proximity.
See Documentation/Examples/ for detailed usage examples and tutorials.
The model returns a dataframe containing:
- Raw data for detected anomalous events
- Anomaly scores
- Prediction confidence metrics from the multifilter stage
This implementation is based on graph evolution techniques from hyperspectral image analysis, adapted for general tabular anomaly detection. The approach uses graph theory principles to analyze data relationships across multiple topological scales.
The HSA model is based on "Graph Evolution-Based Vertex Extraction for Hyperspectral Anomaly Detection" by Xianchang et al. Originally designed for hyperspectral image analysis, this method has been adapted for general anomaly detection by comparing data points in function space rather than focusing solely on geometric proximity.
For data
Edge weights use a Gaussian radial basis function:
where
The similarity matrix
This is normalized using diagonal matrix
The normalized similarity matrix is:
The affinity matrix
where
To analyze relationships across multiple topological scales, we compute matrix powers:
and
Anomaly scores
With constraints
where
The complete objective function
Anomaly scores are converted to binary classifications using standard deviation thresholds. Points with scores exceeding anomaly_std_tolerance standard deviations from the mean are classified as anomalous.