-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Clustering
All clustering related classes are contained within the de.jplag.clustering(.*) packages.
The central idea behind the structure of clustering is the ease of use: To use the clustering calling code should only ever interact with the ClusteringOptions, ClusteringFactory, and ClusteringResult classes:
New clustering algorithms and preprocessors can be implemented using the GenericClusteringAlgorithm and ClusteringPreprocessor interfaces which operate on similarity matrices only. ClusteringAdapter handles the conversion between de.jplag classes and matrices. PreprocessedClusteringAlgorithm adds a preprocessor onto another ClusteringAlgorithm.
Remarks on Spectral Clustering
- based on On Spectral Clustering: Analysis and an algorithm (Ng, Jordan & Weiss, 2001)
- automatic hyper-parameter search using Bayesian Optimization with a Gaussian Process as the surrogate model and L-BFGS for optimization on the surrogate
- the L-BFGS implementation is a pit of technical debt, see here.
Integration Tests
There are integration tests for the Spectral Clustering to verify, that a least in the case of two known sets of similarities the groups known to be colluders are found. However, these are considered to be sensitive data. The datasets are not available to the public and these tests can only be run by maintainers with access.
To run these tests the contents of the PseudonymizedReports repository must added in the folder jplag/src/test/resources/de/jplag/PseudonymizedReports.