Skip to content
This repository was archived by the owner on Jan 21, 2024. It is now read-only.

User Guide

iskander-m edited this page Jul 9, 2020 · 3 revisions

Post installation instructions

Assign "ClusterPac Admin" permission set to administrators. This permission set allows users to create cluster models.

Assign "ClusterPac User" permission set to users. This permission set allows users to run models and view job results.

Overview

Cluster analysis is a task of grouping similar objects into a number of groups called clusters. Cluster Analysis Application will group your data into clusters using unsupervised machine learning algorithms, such as K-Means and K-Medoids. To get more details on these algorithms and methodology refer to the "Cluster Analysis in Salesforce" document attached to the AppExchange listing.

Clustering of your data is performed in 3 steps:

  1. Create cluster model

On this step you will choose the source object, select the fields which will be used in record similarity (distance) calculations, specify filter criteria and distance type for each field and choose the clustering algorithm.

  1. Run cluster model

Here you will specify clustering algorithm parameters, such as the number of clusters you want your data to group into. After this you will start the asynchronous job which will perform the calculations.

  1. View and export results

After the calculations are complete you will see the formed clusters and cluster centers (centroids). Each source record will be mapped to a corresponding cluster and this information will be stored in the clustan__ClusterJobResult__c object. A visual representation of all clusters will be displayed using t-SNE technique.

Create cluster model

  1. In the app launcher navigate to "Cluster Analysis" application. Go to "Cluster Models" tab and click the "New" button. Create model UI will appear:

  1. In the dropdown select a Salesforce object which records you want to cluster. A list of fields, filter options and SOQL query editor will appear with pre-selected Id and Name fields:

  1. From the list of fields choose fields which you want to use for clustering. You can use fields from the current and parent objects. SOQL query will change accordingly after you check or uncheck fields:

  1. Optionally you can specify filter conditions. Click "Add Condition" button to create a new filter condition. Choose a field, condition operation and filter value:

You can also edit SOQL query directly in the SOQL query editor. Please note that not all SOQL queries can be processed. Some SOQL functions such as aggregate (COUNT, SUM, etc) or FORMAT are not supported. A SOQL parsing error will be displayed in case if there is a problem with the query.

  1. Click "Next" to go to model configuration screen

  2. On this screen enter model name, clustering algorithm and default number of clusters. "K-Medoids" algorithm can be used for most of the cases. You can use "K-Means" if your model has only numeric clustering fields.

  1. Distance functions were automatically populated based on the field type. You can change the distance function/field type using the dropdown besides each field. The following values are available:
Field Type Description
Numeric Manhattan or Euclidean distance function will be used to measure similarity of values
Text Levenstein distance function will be used to measure similarity of values. Not supported in K-means
Category An Equal (0)/Not equal (1) comparison will be used to measure similarity of values. Not supported in K-means
Long Text TF-IDF algorithm (based on identical keywords) will be used to measure similarity
None The field will be skipped for similarity calculation. This is usually set for ID and name fields which are required to identify records
Cluster result output A cluster number will be stored in this field during cluster calculations. The field name must start with 'ClusterNumber'. Warning: Any existing values in this field will be overwritten. Refer to "Populating cluster number into the source object" section for more details

Optionally you can specify weight for each field. To do this enter a decimal value in the 0..1 range. Default weight value in 1 which corresponds to 100%. This value will be used in distance calculations.

  1. Click Save.

Run cluster model

There are 2 ways to run a model. First option is to run a model from the view model screen. To do this go to "Cluster Models" tab and click on a model which you want to run. You will be navigated to the view model screen. Click "Run model" button in the top right corner:

A dialog box will appear where you can change clustering algorithm parameters:

You can change the number of clusters or specify whether or not to run refinement step. Click "Run" once you enter parameters.

Another option to run a model is to create a new cluster job. To do this go to "Cluster Jobs" tab and click "New" button:

You will be navigated to the new cluster job screen. On this screen choose a model from the list of models on the left, then enter algorithm parameters on the right and click "Run":

After you click "Run" a new cluster job will be created and you will be navigated to the job view screen:

The job is executed asynchronously by batch apex classes. The calculations might take some time depending on the number of records in the dataset, clustering algorithm and the number of clustering fields. The dataset of 1000 records will be processed in 5-10 minutes depending on the current instance load. The processing time of a 100K dataset varies from 40 minutes up to 2 hours. K-Medoids algorithm is usually faster than K-Means because it uses a subset of records in most of the calculations.

The clustering of LongText fields takes significantly more time than Numeric, Category or Text fields due to the complexity of TF-IDF calculations. The processing of 1000 records with long text data might take up to 1 hour with K-Medoids and up to several hours with K-Means.

You can cancel the job using the "Cancel job" button. When calculations are finished the results will be displayed automatically.

View job details

You can see a list of "In Progress" or "Completed" jobs in the "Cluster Jobs" tab. Click on any job from the list to display the results. You will be navigated to the view job screen:

If the job is completed you will see the calculation details such as silhouette score and the list of clusters with the record count and centroid records for each cluster.

By default the cluster job name will be the same as the cluster model name. You can rename the job using the pencil icon besides the job name.

t-SNE graph will be displayed below. t-SNE calculations might take a few minutes. The progress will be displayed on the graph itself - you will see the data points changing positions. Once the data points on the screen are stable the calculation is complete:

Please note that t-SNE reduces dimensionality of the data for visualization in a low-dimensional space, 2D in this case. t-SNE does not preserve distances, it only preserves nearest neighbors, so visualization of clusters might not match with the clustering algorithm results. t-SNE is a probabilistic algorithm, each consecutive run might produce a different output. You can change t-SNE graph parameters, such as: collide option, epsilon and perplexity. The graph will be recalculated accordingly. For more information on this technique and parameters refer to t-SNE wiki page.

The graph will display 500 random data points (records) by default. You can change this in "Cluster Setting" custom metadata type in the setup, parameter name "TSNE Size". Please note that the larger value will require more memory and CPU usage which might slow down your system. The calculations are performed by the javascript in the browser. The maximum recommended value is 1500

Cluster job results

Clustering results will be stored in clustan__ClusterJobResult__c object. The object contains associations between the source object records and calculated clusters. This object has the following fields:

Field name Description
clustan__ClusterJob__c Contains the id of the cluster job record which was created when you ran the model. Reference to clustan__ClusterJob__c
clustan__ClusterNumber__c Contains the number of the record's cluster. Starts with 0
clustan__Cluster__c Contains the id of the record's cluster. Reference to clustan__ClusterJobCluster__c
clustan__RecordId__c Contains the record id of the cluster model object record. If Lead object was specified in the cluster model this field will contain a lead record id
clustan__RecordName__c Contains the record name
clustan__Json__c Contains the record data in json format. The data will only contain fields specified in the cluster model

Cluster job results will be displayed on the job details page. You can also export cluster job results to a csv file using the Salesforce report builder or the Data Loader tool. You need to filter the data on the job id field (clustan__ClusterJob__c) to retrieve results for a specific job (most cases).

Populating cluster number into the source object

It is possible to populate a cluster number value into the source (model) object records. To do this you will need to create a custom field in the source object which starts with "ClusterNumber" and add this field to the model with "Cluster result output" field type. Here is how to populate the cluster number into Lead records:

  1. In the setup go to "Object Manager", click on "Lead" object and go to "Fields & Relationships"

  2. Click "New" to create a new custom field.

  3. Select "Number" data type and click "Next"

  4. Enter "ClusterNumber" into the field label. Make sure that field name is also set to "ClusterNumber". You can change the field label if you want but the field name must start with "ClusterNumber":

  1. Click "Next"

  2. Set field level security for profiles and add this field to your page layout(s).

  3. Click "Save". The field is created.

  4. Go to "Cluster Models" tab and edit an existing Lead model. Create a new Lead model if you don't have one

  5. In the edit model screen find and select the "ClusterNumber__c" field. Click "Next"

  1. Specify "Cluster Result Output" for "ClusterNumber__c" field and click "Save"

  1. Click "Run model" to run the model and populate the cluster number into Lead.ClusterNumber__c field. Once the clustering job is finished the cluster number will be populated into ClusterNumber__c field for all processed Lead records:

Please note that each time you run the model the previous values in this field will be overwritten.

Delete cluster job or model

To delete a cluster job go to the cluster job details page and click the "Delete" button. You will be redirected to a confirmation page. After confirming your action an asynchronous batch job will start which will delete this cluster job. Warning: all associated clusters and job results will also be deleted.

To delete a model go to the model details page and click the "Delete" button. After confirming your action an asynchronous batch job will start which will remove this cluster model. Warning: all associated cluster jobs, clusters and job results will also be deleted.

The delete operation in both cases might take several minutes depending on the size of the objects and number of records being deleted.

Appendix

Cluster objects ER Diagram