Skip to content

Readme #9

@SimDing

Description

@SimDing

This CLI has new options:

Clustering:
  --cluster-skip         Skips the clustering (Standard: false)
  --cluster-alg {AGGLOMERATIVE,SPECTRAL}
                         Which clustering algorithm to use. Agglomerative merges similar submissions  bottom  up.  Spectral  clustering  is  combined  with Bayesian Optimization to execute the k-Means clustering
                         algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
  --cluster-metric {AVG,MIN,MAX,INTERSECTION}
                         The metric used for clustering. AVG is Dice's coefficient, MAX is the overlap coefficient and can prevent some methods of obfuscation. (Standard: MAX)
  --cluster-spectral-bandwidth bandwidth
                         Bandwidth of the matern kernel in the Gaussian Process used during the search  for  a  good  number  of  clusters for spectral clustering. If a good clustering result is found during the
                         search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. (Standard: 20.0)
  --cluster-spectral-noise noise
                         The result of each run in the search for good clusterings are random. The  noise  level  models  the  variance in the "worth" of these results. It also acts as a regularization constant.
                         (Standard: 0.0025000002)
  --cluster-spectral-min-runs min
                         Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
  --cluster-spectral-max-runs max
                         Maximum number of k-Means executions during spectral clustering. Any execution after the  initial  runs  tries to balance between exploration of unknown clustering sizes and exploitation
                         of clustering sizes known as good. (Standard: 50)
  --cluster-spectral-kmeans-interations iterations
                         Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
  --cluster-agglomerative-threshold threshold
                         Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
  --cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
                         How to measure the similarity of two clusters during agglomerative clustering. Minimum, maximum or average similarity between the submissions in each cluster. (Standard: AVERAGE)

Clustering - Preprocessing:
  --cluster-pp-none      Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
  --cluster-pp-cdf       Before clustering, the value of the cumulative distribution function of  all  similarities  is  estimated.  The  similarities  are multiplied with these estimates. This has the effect of
                         supressing similarities that are low compared to other similarities. (Standard: false)
  --cluster-pp-percentile percentile
                         Any similarity smaller than the given percentile will be suppressed during clustering.
  --cluster-pp-threshold threshold
                         Any similarity smaller than the given threshold value will be suppressed during clustering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions