Skip to content

Conversation

@claire-simpson
Copy link
Contributor

@claire-simpson claire-simpson commented Dec 10, 2025

Description

Implements LULC-based option (scenario 2) in the Urban Mental Health model and unifies NDVI preprocessing across scenarios.

Adds workflow for LULC reclassification to NDVI:

  • build LULC-to-NDVI mapping either from attribute table or calculate mean NDVI (from ndvi_base) by lucode if needed
  • auto-masking: map excluded LULC classes to NODATA
  • write lulc_to_ndvi_map.csv to store the final mapping used
  • reclassify LULC to mean NDVI

Other changes:

  • Refactor preprocessing so LULC and NDVI scenarios share a common align --> mask (and reclassify if lulc) --> buffer-mean --> delta_NDVI pipeline.
  • Validation for if ndvi column and base_ndvi both aren't provided for option 2
  • Add unit tests

Fixes #2142

Checklist

  • Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
  • Updated the user's guide (if needed)
  • Tested the Workbench UI (if relevant)

task_name="calculate delta ndvi" # change in nature exposure
buffer_alt_dependencies.append(mask_alt_ndvi_task)

kernel_task = task_graph.add_task(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend hiding whitespace when reviewing this, as most (all?) of the changes between kernel_task and zonal stats are just unindenting

@claire-simpson claire-simpson marked this pull request as ready for review December 12, 2025 22:48
Copy link
Member

@emilyanndavis emilyanndavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

id="lulc_to_ndvi_csv",
path="intermediate/lulc_to_ndvi_map.csv",
about=gettext(
"Table giving mean NDVI by LULC codes, with excluded LULC "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Table giving mean NDVI by LULC codes, with excluded LULC "
"Table giving mean NDVI by LULC code, with excluded LULC "

Comment on lines 1079 to 1088
unique_lucodes, inverse_indices = numpy.unique(
masked_lulc.astype(numpy.int64), return_inverse=True)

sums = numpy.bincount(inverse_indices,
weights=masked_ndvi.astype(numpy.float32))
counts = numpy.bincount(inverse_indices)
means = sums / counts

mean_ndvi_by_lulc_dict = {lucode: mean for lucode, mean in zip(
unique_lucodes, means)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, very clever solution! I've never used bincount before, so I had to go to the docs and puzzle over this for a bit before I could grasp exactly what's going on here. Not sure if it would be worth adding a comment to help explain it up front? On the one hand, a gloss might have helped me (though I probably would have looked it up in the docs as well); on the other hand, it already seems pretty clear for people who (unlike me) are fluent in numpy. Up to you. Just thought I'd mention it on the off-chance you were already considering adding a comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great idea! It is definitely not completely intuitive so I'll add several comments

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @claire-simpson , I didn't make it all the way through but wanted to provide some initial feedback in a timely manner!

for input_raster, resample_method in raster_to_method_dict.items():
if args[input_raster]:
input_align_list.append(args[input_raster])
output_align_list.append(file_registry[input_raster+'_aligned'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally adopted f-strings for string concatenation. I think it's technically "faster" and can be more readable than using the plus operator.

args=(args['lulc_attr_csv'],
file_registry['lulc_to_ndvi_csv'],
file_registry['lulc_base_aligned'], # Note: won't get used if `ndvi` column in `lulc_attr_csv``
file_registry['ndvi_base_aligned'] #TODO: will this get set to None if used doesn't input ndvi_base?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a TODO to work through for this PR or later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops no, I just forgot to delete this. I determined that for args, if the user doesn't enter an optional input, it gets set to None, but the value of the file_registry[key] is the file path regardless of whether its actually used/created (so, not set to None).

for lu, ndvi, exclude in zip(codes, ndvi_means, excludes):
if bool(exclude):
value_map[lu] = FLOAT32_NODATA
elif numpy.isfinite(lu):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this fails and there is a positive infinity or NaN value? Given the ModelSpec validation for this should we be worried about it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably unnecessary. I think I added the isfinite check because I was finding that pandas was reading in an empty line at the end of the lulc csv.. but I couldn't figure out why then and now I can't reproduce this behavior.

elif base_ndvi_path:
LOGGER.info("Using NDVI raster to calculate mean NDVI by LULC class.")
lulc_dict = {lu: ex for lu, ex in zip(codes, excludes)
if numpy.isfinite(lu)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used somewhere else too, but is it necessary to check numpy.isfinite here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - I think I added the isfinite check because I was finding that pandas was reading in an empty line at the end of the lulc csv.

Dict containing the mean NDVI values for each LULC class
"""
# TODO: iterblocks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we'll want to handle this a little differently to avoid loading the entire raster into memory with raster_to_numpy_array. A common way is using iterblocks and keeping a running average as you iterate. Habitat Quality does something like this where it counts the number of unique pixel values. But it could be altered to keep a running average too.

def _raster_pixel_count(raster_path_band):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks I've implemented this!

data_type=int,
units=None,
required="scenario=='lulc'",
# Allow lulc_alt for masking if using scenario 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the scenarios numbered or would it be helpful to use the name descriptor instead?

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @claire-simpson ! I took another pass and had some comments and suggestions.

func=build_lulc_ndvi_table,
args=(args['lulc_attr_csv'],
file_registry['lulc_to_ndvi_csv'],
file_registry['lulc_base_aligned'], # Note: won't get used if `ndvi` column in `lulc_attr_csv``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file_registry['lulc_base_aligned'], # Note: won't get used if `ndvi` column in `lulc_attr_csv``
file_registry['lulc_base_aligned'], # Note: won't get used if `ndvi` column in `lulc_attr_csv`

args=(file_registry['lulc_base_aligned'],
file_registry['lulc_to_ndvi_csv'],
file_registry['ndvi_base_aligned_masked']),
# this outputs raster to use as new NDVI to use going forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments wording is a bit unclear and maybe unnecessary?

file_registry['ndvi_base_aligned_masked'])["datatype"],
'target_nodata': pygeoprocessing.get_raster_info(
file_registry['ndvi_base_aligned_masked'])["nodata"][0]},
dependent_task_list=buffer_base_dependencies+[kernel_task],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit pick that this list concatenation has no spaces and line 794 has a space :)

None
"""
lulc_df = pandas.read_csv(lulc_attr_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some benefits to opening input tables using the MODEL_SPEC.get_input function. Here's an example from the carbon model.

carbon_pool_df = MODEL_SPEC.get_input(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also might guard against the behavior you were seeing with trailing empty lines, etc!


create_lulc_ndvi_csv_task = task_graph.add_task(
func=build_lulc_ndvi_table,
args=(args['lulc_attr_csv'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in the function as well but the MODEL_SPEC has some handy get_input methods that allow us to open CSV tables in a way consistent manner. I gave an example from the carbon model below.

# return_inverse returns the indices of unique array
# e.g., for lulc array [1, 2, 2, 5]: unique_lucodes = [1, 2, 5]
# and inverse_indices = [0, 1, 1, 2]
unique_lucodes, inverse_indices = numpy.unique(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.unique also has a return_counts parameter that might be a little more straightforward than using bincount?

block_counts = numpy.bincount(inverse_indices)

# accumulate into global sums/counts
for lucode, sum, c in zip(unique_lucodes, block_sums, block_counts):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bincount solution is pretty neat and not one that I would have thought of. For the sake of sharing ideas, since we ultimately need to iterate over the unique codes, another solution would be kind of our typical one:

for lucode in unique_lucodes:
    if lucode == nodata:
        continue    
    ndvi_sum = numpy.sum(ndvi[lulc==lucode])
    sum[lucode] = ndvi_sum
    etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One positive of this approach is that we might be able to reduce some of the above code that flattens the array and creates the nodata masks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although checking nodata blocks and continuing if there's no relevant data seems like a good thing to keep!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using return_counts in np.unique is a good idea! However I do think this approach of iterating over the unique codes would be slower because, for each lucode, you're looking at the whole block to build lucode == nodata and sum. So I guess the tradeoff is the current approach being faster but the approach you proposed being more readable. What do you think?

target_nodata = FLOAT32_NODATA

# Create lucode: ndvi dict
lulc_df = pandas.read_csv(mean_ndvi_by_lulc_csv, index_col='lucode')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non model spec input CSVs, there is a utils function for opening CSV's that does some sanitization. Could be helpful even though this CSV is curated by the model.

def read_csv_to_dataframe(path, **kwargs):

# raise error if user enters lulc_attr_csv without 'ndvi' column and also
# doesn't provide base_ndvi raster
if args['scenario'] == 'lulc' and args.get('lulc_attr_csv'):
lulc_df = pandas.read_csv(args['lulc_attr_csv'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use case where using the MODEL_SPEC helper functions to open this could be nice.

@claire-simpson claire-simpson requested a review from dcdenu4 January 9, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants