Skip to content

shubhangseth/data-distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Distillation using Variance Decomposition and Condition Indices

Link: https://bit.ly/3oBRM3z

With the abundance of data available today we are looking for efficient ways to use optimum data that would be required to train models and deploy it on edge devices. In this project we explore one method for distilling data using Variance Decomposition and Condition indices to reduce the original dataset by 50-60% while maintaining performance across various metrics.

Requirements:

Python 3.6 and above

Setup

To download this project locally use,
git clone https://github.com/shubhangseth/data-distillation.git

To run this project use, python main.py

To alter any parameters of the running of the project, changes have to be made to config.json in /config/config.json

Inputs for data distillation

{
  "data_preprocessing": 
  {
    "cols_filename": "/home/ubuntu/proj/psam_pusa_colnames.csv",
    "cols_to_drop_csv_file": "config/drop_cols.csv"
  },
  "sampling_params": 
  {
    "n" : 5,
    "m" : 6,
    "min_rows_per_strata" : 500,
    "distillation": false
  },
  "model_params": 
  {
    "output_dims" : 1,
    "learning_rate" : 1e-3,
    "epochs" : 80,
    "batch_size" : 128,
    "optimizer_weight_decay": 5e-4,
    "scheduler": 
    {
      "mode":"min",
      "factor": 0.5,
      "patience" : 3,
      "verbose" : true,
      "threshold" : 0.1
    },
    "type": "NeuralNet",
    "data_size": 100000
  },
  "run_workspace": "/home/ubuntu/proj/run/",
  "data_filepath": "/home/ubuntu/proj/psam_pusa/",
  "wandb": {
    "project": "11785-project",
    "entity": "shubhang"
  }

}
  • data_prepocessing: Contains the paths to files that are necessary to preprocess the data in terms of dropping the columns etc
    • cols_filename: path to a csv containing the list of all the columns in the tabular dataset
    • cols_to_drop_csv_file: path to a csv containing a list of all the columns need to be dropped front the tabular dataset
  • sampling_params: Contains the parameters that select the data that has to be sampled
    • n: Chooses top n difficult to learn features that have increasingly represented in the distilled dataset
    • m: Chooses the top m highly related columns for each of the top n features selected
    • min_rows_per_strata: Defines the minimum rows per strata for the implementation of stratified clustering on the grouped data
    • distillation: set to be true if data distillation needs to be enabled, else set to false. This can be used to compare results before and after distillation
  • model_params: Defines the parameters that are necessary for model initialization
    • output_dims: Defines the number of output dimensions
    • learning_rate: sets the learning rate for the model
    • epochs: sets the number of training epochs
    • batch_size: sets the batch size for model training and evaluation
    • optimizer_weight_decay: Factor by which the learning rate of each parameter group decreases
    • scheduler: Describes the settings required for the scheduler
      • mode: One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.
      • factor: Factor by which the learning rate will be reduced. new_lr = lr * factor.
      • patience: Number of epochs with no improvement after which learning rate will be reduced.
      • verbose: If True, prints a message to stdout for each update.
      • threshold: Threshold for measuring the new optimum, to only focus on significant changes.
    • type: If ols, the /models/RegressionNet is used to initialize a simple Least square regression linear predictor. If NeuralNet, a neural network is used as the regression predictor
    • data_size: defines the number of samples required for Random sampling
  • run_workspace: Contains the path for the location to save different log files that are generated by the project
  • data_filepath: Contains the path to the location of the dataset
  • wandb: sets the environment variables for WandB
    • project: defines the name of the project configured on WandB
    • entity: defines the username of WandB

Features

  1. json parameters to easily change distillation mode
  2. json parametrs to configure WandB

Folder Description

  • /config: Contains the configuration files
    • config.json: File to change different options of our project
    • config.py: contains method to read json file as json object
    • drop_cols.csv: csv file that houses the columns to be dropped during data preprocessing
  • /models: contains different model files
    • regressionNet.py: Implements ordinary least squares regression using pytorch
    • neuralNet.py: Implements a neural network of the architecture given below
      alt-text
    • distill.py: Implements sampling algorithm using stratified clustering
  • /util: Contains python script that are necessary implementation of stratified clustering
    • variance_decomposition.py: Script that calculates variance decomposition proportions, condition indices and implements stratified sampling
    • proj_data_utils.py: contains methods to implement data parsing and data pre-processing

About

Data Distillation for Neural Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages