Data Distillation using Variance Decomposition and Condition Indices

With the abundance of data available today we are looking for efficient ways to use optimum data that would be required to train models and deploy it on edge devices. In this project we explore one method for distilling data using Variance Decomposition and Condition indices to reduce the original dataset by 50-60% while maintaining performance across various metrics.

Requirements:

Python 3.6 and above

Setup

To download this project locally use,
git clone https://github.com/shubhangseth/data-distillation.git

To run this project use, python main.py

To alter any parameters of the running of the project, changes have to be made to config.json in /config/config.json

Inputs for data distillation

{
  "data_preprocessing": 
  {
    "cols_filename": "/home/ubuntu/proj/psam_pusa_colnames.csv",
    "cols_to_drop_csv_file": "config/drop_cols.csv"
  },
  "sampling_params": 
  {
    "n" : 5,
    "m" : 6,
    "min_rows_per_strata" : 500,
    "distillation": false
  },
  "model_params": 
  {
    "output_dims" : 1,
    "learning_rate" : 1e-3,
    "epochs" : 80,
    "batch_size" : 128,
    "optimizer_weight_decay": 5e-4,
    "scheduler": 
    {
      "mode":"min",
      "factor": 0.5,
      "patience" : 3,
      "verbose" : true,
      "threshold" : 0.1
    },
    "type": "NeuralNet",
    "data_size": 100000
  },
  "run_workspace": "/home/ubuntu/proj/run/",
  "data_filepath": "/home/ubuntu/proj/psam_pusa/",
  "wandb": {
    "project": "11785-project",
    "entity": "shubhang"
  }

}

data_prepocessing: Contains the paths to files that are necessary to preprocess the data in terms of dropping the columns etc
- cols_filename: path to a csv containing the list of all the columns in the tabular dataset
- cols_to_drop_csv_file: path to a csv containing a list of all the columns need to be dropped front the tabular dataset
sampling_params: Contains the parameters that select the data that has to be sampled
- n: Chooses top n difficult to learn features that have increasingly represented in the distilled dataset
- m: Chooses the top m highly related columns for each of the top n features selected
- min_rows_per_strata: Defines the minimum rows per strata for the implementation of stratified clustering on the grouped data
- distillation: set to be true if data distillation needs to be enabled, else set to false. This can be used to compare results before and after distillation
model_params: Defines the parameters that are necessary for model initialization
- output_dims: Defines the number of output dimensions
- learning_rate: sets the learning rate for the model
- epochs: sets the number of training epochs
- batch_size: sets the batch size for model training and evaluation
- optimizer_weight_decay: Factor by which the learning rate of each parameter group decreases
- scheduler: Describes the settings required for the scheduler
  - mode: One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.
  - factor: Factor by which the learning rate will be reduced. new_lr = lr * factor.
  - patience: Number of epochs with no improvement after which learning rate will be reduced.
  - verbose: If True, prints a message to stdout for each update.
  - threshold: Threshold for measuring the new optimum, to only focus on significant changes.
- type: If ols, the /models/RegressionNet is used to initialize a simple Least square regression linear predictor. If NeuralNet, a neural network is used as the regression predictor
- data_size: defines the number of samples required for Random sampling
run_workspace: Contains the path for the location to save different log files that are generated by the project
data_filepath: Contains the path to the location of the dataset
wandb: sets the environment variables for WandB
- project: defines the name of the project configured on WandB
- entity: defines the username of WandB

Features

json parameters to easily change distillation mode
json parametrs to configure WandB

Folder Description

/config: Contains the configuration files
- config.json: File to change different options of our project
- config.py: contains method to read json file as json object
- drop_cols.csv: csv file that houses the columns to be dropped during data preprocessing
/models: contains different model files
- regressionNet.py: Implements ordinary least squares regression using pytorch
- neuralNet.py: Implements a neural network of the architecture given below
- distill.py: Implements sampling algorithm using stratified clustering
/util: Contains python script that are necessary implementation of stratified clustering
- variance_decomposition.py: Script that calculates variance decomposition proportions, condition indices and implements stratified sampling
- proj_data_utils.py: contains methods to implement data parsing and data pre-processing

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
models		models
util		util
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model.png		model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Distillation using Variance Decomposition and Condition Indices

Requirements:

Setup

Inputs for data distillation

Features

Folder Description

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

shubhangseth/data-distillation

Folders and files

Latest commit

History

Repository files navigation

Data Distillation using Variance Decomposition and Condition Indices

Requirements:

Setup

Inputs for data distillation

Features

Folder Description

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages