Skip to content

sujaynr/M3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

M3: Masked Mineral Modeling for continent-scale geological surveying via geospatial infilling

This directory contains the anonymized data scraping and analysis code for M3, a method to infer mineral deposit presence within the conterminous US using infilling methods. These scripts have been reduced for evaluation of the reproducibility of our results under the constraint of the zipped file size being <100MB.

Project Structure

The repository is organized into the following directories:

  • data_generation/: Contains scripts for processing raw geographical and mineral data into a format suitable for the models.
  • analysis/: Contains scripts and notebooks for training the models, running experiments, and performing analysis.
  • derivations/: Contains a Mathematica notebook with derivations related to the project.
  • environment.yml: A conda environment file with all the necessary dependencies.

File Descriptions

data_generation/

These scripts are responsible for creating the datasets used in the analysis. They should be run in the order listed below. Note that file paths are hardcoded in these scripts and all scripts are scraped of identifying information to preserve anonymity during the review process.

  • prepareDataTiling.py: The first step in the data generation pipeline. It processes a CSV file of mineral data (mrds.csv), filters it for specific minerals within the US, and generates a dataset of 50x50 mile squares. For each square, it creates a 50x50 grid and records the count and quality of mineral deposits. The output is an HDF5 file (tinyMineralDataWithCoords.h5).
  • createElevation_DEC.py: Creates a 50x50 grid of elevation data for each square generated by prepareDataTiling.py.
  • createFaults_DEC.py: Creates a 50x50 grid indicating the presence of faults and their slip rates for each square.
  • createGEOAGE_DEC.py: Creates a 50x50 grid with the minimum and maximum geological age for each cell in each square.
  • createRocktype_DEC.py: Creates a 50x50 grid of rock type data for each square.

If you choose to run these scripts, you will not overwrite the sample files provided, which are 50-tile subsets of the actual dataset used in the analysis.

analysis/

This directory contains the core modeling and analysis code.

  • models.py: Defines the neural network architectures used for mineral prediction, including UNet, SpatialTransformer, MineralTransformer, and simpler MLP-based models. It also contains the MineralDataset class for loading and preparing the data.
  • utils.py: A collection of utility functions for training, evaluation, and visualization, including loss functions, evaluation metrics (like the Dice coefficient), and plotting functions.
  • train.py: The main training script for the deep learning models. It can be configured with a wide range of command-line arguments to specify the model, data, and training parameters. It supports different data splitting strategies, including a geographic out-of-distribution (OOD) split, and logs results to wandb.
  • runTrain.py: A script for orchestrating large-scale experiments. It defines sweeps over various hyperparameters and can submit training jobs to a Slurm cluster.
  • kriging_models.py: Implements a Kriging model using gpytorch as a baseline for comparison. It includes a MultitaskBernoulliLikelihood and a MTGP (Multitask Gaussian Process) classification model, along with data preparation functions specific to the Kriging approach.
  • run_kriging.ipynb: A Jupyter notebook that demonstrates how to use the Kriging models. It shows the data preparation pipeline and how to set up and run Kriging experiments.

derivations/

  • AppendixG.nb: A Mathematica notebook containing derivations relevant to Appendix G and directions for future work described in the paper.

environment.yml

This file specifies the conda environment required to run the code. You can create the dataAnalysis environment using the following command:

conda env create -f environment.yml

How to Run

Setup

  1. First, ensure you have created the conda environment and activated it via:
conda activate dataAnalysis
  1. We use Weights & Biases for metrics dashboards and logging. Create a wandb project with name M3_review using wandb login, to set up logging for training runs.

  2. Run all commands below from the highest-level directory in this collection.

Additionally, we have provided the necessary files to test the run scripts and code for the main analysis. Optionally, if you would like to run the data scraping and aggregation pipeline, you will need to install the following datasets:

  1. Unzip the MRDS dataset provided under data/ and ensure mrds.csv is directly under data/, e.g.:
cd data/
tar -xvf mrds-csv.zip
  1. Download the USGS Quanternary Faults & Folds database:
cd data/
curl -O https://earthquake.usgs.gov/static/lfs/nshm/qfaults/Qfaults_GIS.zip
tar -xvf Qfaults_GIS.zip
cp SHP/Qfaults_US_Database.* .
rm -rf GDB SHP Symbology Qfaults_GIS.zip
  1. Download the USGS GMNA shapefiles:
cd data/
curl -O https://ngmdb.usgs.gov/gmna/gis_files/gmna_shapefiles.zip
tar -xvf gmna_shapefiles.zip
cp GMNA_SHAPES/Geologic_units.* .
rm -rf gmna_shapefiles.zip GMNA_SHAPES
  1. If you have an account at NASA EarthData you can download the National Map elevation data (CAUTION: >20GB in size!) as follows:
cd data/elevation_data
sh download_elevations.sh
unzip '*.zip'
rm -rf *.hgt.zip
  1. Once you have unzipped all downloads, pulled the necessary files, and deleted the rest, the Data Generation scripts will be available to you.

1. Data Generation

Run the scripts in the data_generation/ directory in the following order:

  1. python data_generation/prepareDataTiling.py
  2. python data_generation/createElevation_DEC.py
  3. python data_generation/createFaults_DEC.py
  4. python data_generation/createGEOAGE_DEC.py
  5. python data_generation/createRocktype_DEC.py

2. Training Deep Learning Models

The analysis/train.py script is used to train the deep learning models. It accepts a variety of command-line arguments to control the training process. For example, to train a U-Net model, you can run:

python analysis/train.py --model_type u --unetArch resnet152 --use_minerals --use_geophys --batch_size 5

Check the argparse flags in analysis/train.py for additional options.

For more advanced use cases, such as running hyperparameter sweeps on a Slurm cluster, you can use the analysis/runTrain.py script. You may need to modify this script to fit your cluster environment. By default it is set to run the auxiliary data sweep performed in the paper, see the boolean flags at the top of the run script for others.

Logs from wandb will be stored under project='M3_review'. If you wish not to view metrics and simply check that the code runs, simply comment out all wandb commands in these scripts and everything will run without issue.

3. Training the Kriging Models

The analysis/run_kriging.ipynb notebook provides a step-by-step guide to running the Kriging models. Open the notebook in a Jupyter environment and execute the cells to prepare the data and run the Kriging experiments. For rapid testing, the number of epochs has been set to 2 and the number of inducing points 16. To play with the sweep parameters, e.g. to set them to the values in the full analysis, you can edit the create_args() function at the bottom of analysis/kriging_models.py. The relevant values are commented out in that method.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •