forked from jplag/JPlag
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
This CLI has new options:
Clustering:
--cluster-skip Skips the clustering (Standard: false)
--cluster-alg {AGGLOMERATIVE,SPECTRAL}
Which clustering algorithm to use. Agglomerative merges similar submissions bottom up. Spectral clustering is combined with Bayesian Optimization to execute the k-Means clustering
algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
--cluster-metric {AVG,MIN,MAX,INTERSECTION}
The metric used for clustering. AVG is Dice's coefficient, MAX is the overlap coefficient and can prevent some methods of obfuscation. (Standard: MAX)
--cluster-spectral-bandwidth bandwidth
Bandwidth of the matern kernel in the Gaussian Process used during the search for a good number of clusters for spectral clustering. If a good clustering result is found during the
search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. (Standard: 20.0)
--cluster-spectral-noise noise
The result of each run in the search for good clusterings are random. The noise level models the variance in the "worth" of these results. It also acts as a regularization constant.
(Standard: 0.0025000002)
--cluster-spectral-min-runs min
Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
--cluster-spectral-max-runs max
Maximum number of k-Means executions during spectral clustering. Any execution after the initial runs tries to balance between exploration of unknown clustering sizes and exploitation
of clustering sizes known as good. (Standard: 50)
--cluster-spectral-kmeans-interations iterations
Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
--cluster-agglomerative-threshold threshold
Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
--cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
How to measure the similarity of two clusters during agglomerative clustering. Minimum, maximum or average similarity between the submissions in each cluster. (Standard: AVERAGE)
Clustering - Preprocessing:
--cluster-pp-none Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
--cluster-pp-cdf Before clustering, the value of the cumulative distribution function of all similarities is estimated. The similarities are multiplied with these estimates. This has the effect of
supressing similarities that are low compared to other similarities. (Standard: false)
--cluster-pp-percentile percentile
Any similarity smaller than the given percentile will be suppressed during clustering.
--cluster-pp-threshold threshold
Any similarity smaller than the given threshold value will be suppressed during clustering.
Metadata
Metadata
Assignees
Labels
No labels