This repository contains code to index a directory of HAND outputs in a way that allows for efficient spatial querying. One way to think of a HAND index is as a set of lookup tables where each entry is linked to a branch level catchment polygon. This allows us to flexibly group HAND outputs according to the spatial relationship of a HAND catchment polygon to a ROI instead of being constrained by the directory structure of given HAND output. One immediate benefit of this approach is that a HAND index allows us to operate over HAND outputs at a variety of HUC scales without having to think about the HUC scale that the HAND outputs were produced at.
The main script that creates a HAND index is called load.py. This script creates four parquet files using DuckDB that are then intended to be queried client side. To create the parquet files, data is first loaded into a local DuckDB database that conforms to the schema in the schema directory. After this load step parquet files that reflect the database contents are then written to either S3 or locally. The benefit of this approach is that it allows for distributed, performant access to the HAND index from object storage without the need for a database server.
Currently there are four tables in the HAND index schema these are:
CatchmentsHydrotablesHAND_REM_RastersHand_Catchment_Rasters
The Catchments table has columns that record: 1. a UUID that is generated by load.py and assigned to each catchment, 2. The HAND version that generated that catchment's data, 3. a geometry column that contains a catchment polygon, 4. A spatial index generated using the H3 tiling system, and 5. A path to the directory that contains the HAND outputs for that catchment/branch. The spatial index is used to partition the parquet tables generated from this table so that DuckDB can efficiently query the index.
The Hydrotables table contains two columns. A UUID column that references the UUID column in the Catchments table and a path to the branch level uncalibrated Hydrotable CSV file that is relevant to that catchment.
The HAND_REM_Rasters table contains two columns. A UUID column that references the UUID column in the Catchments table and a path to the REM raster for that catchment.
The HAND_Catchment_Rasters table contains two columns. A UUID column that references the UUID column in the Catchments table and a path to the pixel-mapped HydroID raster for that catchment.
These data assets were the first to be included in the index because they are the HAND outputs necessary to inundate a HAND REM to produce an extent or depth FIM to be used by NGWPC's autoeval functionality. There are currently plans to add an additional table to index calibrated hydrotables. As more functionality is added to autoeval-coordinator and autoeval-jobs more HAND outputs could also be added to the index. These outputs could be either at the branch or HUC level with sufficient modification to the schema and load.py.
To create a HAND index you would first build a docker image using:
docker build -t hand-index:latest .
Then you should create a .env file with AWS credentials able to access the bucket where your HAND outputs are stored and where you will be writing the parquet files to. This will look something like:
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_SESSION_TOKEN=your_session_token_if_using_temp_credentials
AWS_DEFAULT_REGION=us-east-1
Note: make sure your AWS credentials aren't quoted! This will lead to credential errors.
Then you would run load.py using the built docker image with the following command from the repository root:
docker run -v $(pwd)/data:/data \
-v $(pwd)/schema:/schema \
--env-file .env \
hand-index:latest \
python load.py \
--db-path /data/database.ddb \
--schema-path /schema/hand-index-ver-fim100.sql \
--hand-dir s3://fimc-data/hand_fim/outputs/output_directory/ \
--hand-version fim100 \
--h3-resolution 1 \
--output-dir s3://fimc-data/autoeval/hand_output_indices/output_index_name/ \
--batch-size 20 \
This should initiate the index creation script. Upon successful index creation you should see the message Workflow Complete printed to the terminal.
The key arguments in load.py are:
--db-path: Path to the intermediate DuckDB database file (mounted inside container)--schema-path: Path to the schema to use for the HAND version being indexed--hand-dir: S3 path to HAND data source. Local path support will be added in the future.--hand-version: HAND version identifier. This will be recorded in the Catchments table.--output-dir: S3 or local path for Parquet output--h3-resolution: This argument determines the resolution of the spatial index used to partition the Catchments table when they are written to parquet files. Higher resolutions will result in more partitioned Catchment parquet files. This is an optional argument.--skip-load: Add this optional argument if database already has data and you only want to export the tables to parquet--batch-size: This determines the number of catchment directories processed at once during the creation of the Catchments table. This is an optional argument.
Once a HAND index is created it can be interacted with via any library that knows how to work with parquet files. Because we partition the Catchments table using Duckdb's hive partitioning we use to perform client side queries of HAND indexes. An example of querying a HAND index is provided in query_geojson.py. The logic in this script is very similar to that used by the code in the autoeval-coordinator repository.
To run this query script you would use the same Docker image and .env file and the following command:
docker run -v $(pwd)/data:/data \
-v $(pwd)/queries:/queries \
-v $(pwd)/output:/output \
--env-file .env \
hand-index:latest \
python query_geojson.py \
--geojson /queries/query_polygon.geojson \
--partitioned-path s3://fimc-data/autoeval/hand_output_indices/output_index_name/ \
--threshold 10.0 \
--outdir /output/catchment_results
The arguments for query_geojson.py are:
--geojson: This is a path to a geojson formatted polygon that is used as the ROI over which to aggregate HAND data.--partitioned-path: This a path to the directory where the parquet tables that comprise the HAND index are stored.--threshold: This is an "overlap threshold" that determines how aggressively catchments are excluded from the final data returned by a query. If the overlap threshold is very low then even catchments that only share a small % of their area are included in the query results. If the overlap threshold is high then catchments need to be contained within the ROI or need to have a large amount of shared area to be included in the results.--outdir: This is where the results from the query will be stored.
Once again this is just an example script illustrating how to work with the outputs of load.py. Currently the main use case for a HAND index is for the query performed by the code in the autoeval-coordinator repository.
Currently this repository is designed to work with HAND versions that write there outputs following the HUCS -> branch pattern and that can be inundated using the data indexed by the 4 tables listed above. In the future, if the output format or output data from HAND changes then a new schema and version of load.py would be created. In this case there would likely be different versions of load.py depending on which schema is being used.