Link: https://bit.ly/3oBRM3z
With the abundance of data available today we are looking for efficient ways to use optimum data that would be required to train models and deploy it on edge devices. In this project we explore one method for distilling data using Variance Decomposition and Condition indices to reduce the original dataset by 50-60% while maintaining performance across various metrics.
Python 3.6 and above
To download this project locally use,
git clone https://github.com/shubhangseth/data-distillation.git
To run this project use,
python main.py
To alter any parameters of the running of the project, changes have to be made to config.json in /config/config.json
{
"data_preprocessing":
{
"cols_filename": "/home/ubuntu/proj/psam_pusa_colnames.csv",
"cols_to_drop_csv_file": "config/drop_cols.csv"
},
"sampling_params":
{
"n" : 5,
"m" : 6,
"min_rows_per_strata" : 500,
"distillation": false
},
"model_params":
{
"output_dims" : 1,
"learning_rate" : 1e-3,
"epochs" : 80,
"batch_size" : 128,
"optimizer_weight_decay": 5e-4,
"scheduler":
{
"mode":"min",
"factor": 0.5,
"patience" : 3,
"verbose" : true,
"threshold" : 0.1
},
"type": "NeuralNet",
"data_size": 100000
},
"run_workspace": "/home/ubuntu/proj/run/",
"data_filepath": "/home/ubuntu/proj/psam_pusa/",
"wandb": {
"project": "11785-project",
"entity": "shubhang"
}
}
data_prepocessing: Contains the paths to files that are necessary to preprocess the data in terms of dropping the columns etccols_filename: path to a csv containing the list of all the columns in the tabular datasetcols_to_drop_csv_file: path to a csv containing a list of all the columns need to be dropped front the tabular dataset
sampling_params: Contains the parameters that select the data that has to be sampledn: Chooses top n difficult to learn features that have increasingly represented in the distilled datasetm: Chooses the top m highly related columns for each of the top n features selectedmin_rows_per_strata: Defines the minimum rows per strata for the implementation of stratified clustering on the grouped datadistillation: set to betrueif data distillation needs to be enabled, else set tofalse. This can be used to compare results before and after distillation
model_params: Defines the parameters that are necessary for model initializationoutput_dims: Defines the number of output dimensionslearning_rate: sets the learning rate for the modelepochs: sets the number of training epochsbatch_size: sets the batch size for model training and evaluationoptimizer_weight_decay: Factor by which the learning rate of each parameter group decreasesscheduler: Describes the settings required for the schedulermode: One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.factor: Factor by which the learning rate will be reduced. new_lr = lr * factor.patience: Number of epochs with no improvement after which learning rate will be reduced.verbose: IfTrue, prints a message to stdout for each update.threshold: Threshold for measuring the new optimum, to only focus on significant changes.
type: Ifols, the/models/RegressionNetis used to initialize a simple Least square regression linear predictor. IfNeuralNet, a neural network is used as the regression predictordata_size: defines the number of samples required for Random sampling
run_workspace: Contains the path for the location to save different log files that are generated by the projectdata_filepath: Contains the path to the location of the datasetwandb: sets the environment variables for WandBproject: defines the name of the project configured on WandBentity: defines the username of WandB
- json parameters to easily change distillation mode
- json parametrs to configure WandB
/config: Contains the configuration filesconfig.json: File to change different options of our projectconfig.py: contains method to read json file as json objectdrop_cols.csv: csv file that houses the columns to be dropped during data preprocessing
/models: contains different model files/util: Contains python script that are necessary implementation of stratified clusteringvariance_decomposition.py: Script that calculates variance decomposition proportions, condition indices and implements stratified samplingproj_data_utils.py: contains methods to implement data parsing and data pre-processing
