This directory contains the anonymized data scraping and analysis code for M3, a method to infer mineral deposit presence within the conterminous US using infilling methods. These scripts have been reduced for evaluation of the reproducibility of our results under the constraint of the zipped file size being <100MB.
The repository is organized into the following directories:
data_generation/: Contains scripts for processing raw geographical and mineral data into a format suitable for the models.analysis/: Contains scripts and notebooks for training the models, running experiments, and performing analysis.derivations/: Contains a Mathematica notebook with derivations related to the project.environment.yml: A conda environment file with all the necessary dependencies.
These scripts are responsible for creating the datasets used in the analysis. They should be run in the order listed below. Note that file paths are hardcoded in these scripts and all scripts are scraped of identifying information to preserve anonymity during the review process.
prepareDataTiling.py: The first step in the data generation pipeline. It processes a CSV file of mineral data (mrds.csv), filters it for specific minerals within the US, and generates a dataset of 50x50 mile squares. For each square, it creates a 50x50 grid and records the count and quality of mineral deposits. The output is an HDF5 file (tinyMineralDataWithCoords.h5).createElevation_DEC.py: Creates a 50x50 grid of elevation data for each square generated byprepareDataTiling.py.createFaults_DEC.py: Creates a 50x50 grid indicating the presence of faults and their slip rates for each square.createGEOAGE_DEC.py: Creates a 50x50 grid with the minimum and maximum geological age for each cell in each square.createRocktype_DEC.py: Creates a 50x50 grid of rock type data for each square.
If you choose to run these scripts, you will not overwrite the sample files provided, which are 50-tile subsets of the actual dataset used in the analysis.
This directory contains the core modeling and analysis code.
models.py: Defines the neural network architectures used for mineral prediction, includingUNet,SpatialTransformer,MineralTransformer, and simpler MLP-based models. It also contains theMineralDatasetclass for loading and preparing the data.utils.py: A collection of utility functions for training, evaluation, and visualization, including loss functions, evaluation metrics (like the Dice coefficient), and plotting functions.train.py: The main training script for the deep learning models. It can be configured with a wide range of command-line arguments to specify the model, data, and training parameters. It supports different data splitting strategies, including a geographic out-of-distribution (OOD) split, and logs results towandb.runTrain.py: A script for orchestrating large-scale experiments. It defines sweeps over various hyperparameters and can submit training jobs to a Slurm cluster.kriging_models.py: Implements a Kriging model usinggpytorchas a baseline for comparison. It includes aMultitaskBernoulliLikelihoodand aMTGP(Multitask Gaussian Process) classification model, along with data preparation functions specific to the Kriging approach.run_kriging.ipynb: A Jupyter notebook that demonstrates how to use the Kriging models. It shows the data preparation pipeline and how to set up and run Kriging experiments.
AppendixG.nb: A Mathematica notebook containing derivations relevant to Appendix G and directions for future work described in the paper.
This file specifies the conda environment required to run the code. You can create the dataAnalysis environment using the following command:
conda env create -f environment.yml- First, ensure you have created the conda environment and activated it via:
conda activate dataAnalysis-
We use Weights & Biases for metrics dashboards and logging. Create a
wandbproject with nameM3_reviewusingwandb login, to set up logging for training runs. -
Run all commands below from the highest-level directory in this collection.
Additionally, we have provided the necessary files to test the run scripts and code for the main analysis. Optionally, if you would like to run the data scraping and aggregation pipeline, you will need to install the following datasets:
- Unzip the MRDS dataset provided under
data/and ensure mrds.csv is directly underdata/, e.g.:
cd data/
tar -xvf mrds-csv.zip- Download the USGS Quanternary Faults & Folds database:
cd data/
curl -O https://earthquake.usgs.gov/static/lfs/nshm/qfaults/Qfaults_GIS.zip
tar -xvf Qfaults_GIS.zip
cp SHP/Qfaults_US_Database.* .
rm -rf GDB SHP Symbology Qfaults_GIS.zip- Download the USGS GMNA shapefiles:
cd data/
curl -O https://ngmdb.usgs.gov/gmna/gis_files/gmna_shapefiles.zip
tar -xvf gmna_shapefiles.zip
cp GMNA_SHAPES/Geologic_units.* .
rm -rf gmna_shapefiles.zip GMNA_SHAPES- If you have an account at NASA EarthData you can download the National Map elevation data (CAUTION: >20GB in size!) as follows:
cd data/elevation_data
sh download_elevations.sh
unzip '*.zip'
rm -rf *.hgt.zip- Once you have unzipped all downloads, pulled the necessary files, and deleted the rest, the Data Generation scripts will be available to you.
Run the scripts in the data_generation/ directory in the following order:
python data_generation/prepareDataTiling.pypython data_generation/createElevation_DEC.pypython data_generation/createFaults_DEC.pypython data_generation/createGEOAGE_DEC.pypython data_generation/createRocktype_DEC.py
The analysis/train.py script is used to train the deep learning models. It accepts a variety of command-line arguments to control the training process. For example, to train a U-Net model, you can run:
python analysis/train.py --model_type u --unetArch resnet152 --use_minerals --use_geophys --batch_size 5Check the argparse flags in analysis/train.py for additional options.
For more advanced use cases, such as running hyperparameter sweeps on a Slurm cluster, you can use the analysis/runTrain.py script. You may need to modify this script to fit your cluster environment. By default it is set to run the auxiliary data sweep performed in the paper, see the boolean flags at the top of the run script for others.
Logs from wandb will be stored under project='M3_review'. If you wish not to view metrics and simply check that the code runs, simply comment out all wandb commands in these scripts and everything will run without issue.
The analysis/run_kriging.ipynb notebook provides a step-by-step guide to running the Kriging models. Open the notebook in a Jupyter environment and execute the cells to prepare the data and run the Kriging experiments. For rapid testing, the number of epochs has been set to 2 and the number of inducing points 16. To play with the sweep parameters, e.g. to set them to the values in the full analysis, you can edit the create_args() function at the bottom of analysis/kriging_models.py. The relevant values are commented out in that method.