Skip to content

Conversation

@Samia35-2973
Copy link

@Samia35-2973 Samia35-2973 commented Jul 13, 2025

This system combines DeepForest object detection with SmolLM3-3B and Qwen2.5-VL-3B to provide ecological image analysis. The system orchestrates four agents (Memory, DeepForest Detector, Visual, and Ecology) that work sequentially to maintain context from conversation history, detect birds, trees, livestock, alive/dead trees from the image, analyze both the original image and annotated image for validation, and synthesize all these data to provide ecological insights to the user.

See the README for detailed workflow diagrams and technical specifications.

Usage

1. Create and activate a Conda environment

conda create -n deepforest_agent python=3.12.11
conda activate deepforest_agent

2. Install dependencies

pip install -r requirements.txt
pip install -e .

3. Configure the HuggingFace token

Create a .env file in the root directory of the deepforest-agent project and add your HuggingFace token like below:

HF_TOKEN="your_huggingface_token_here"

You can obtain your token from HuggingFace Access Token. Make sure the Token type is "Write".

4. Run the interface

The DeepForest Agent runs through a Gradio web interface. To start the interface, execute:

python -m deepforest_agent.app

A link like http://127.0.0.1:7860 will appear in the terminal. Open it in your browser to interact with the agent. A public Gradio link may also be provided if available.

Hardware Requirements

  • GPU: GPU with at least 24GB VRAM (recommended for optimal performance)
  • Storage: At least 16GB free space for model downloads

@Samia35-2973
Copy link
Author

I’ve just pushed the latest changes. Could you please take a look when you get a chance @henrykironde @jveitchmichaelis?
I’ve added the logs here for review in here https://drive.google.com/drive/folders/1NJcDphxb_TfUB-ayHJGqPh5TvSVYgtp8?usp=drive_link
You can find the .log file in each image folder along with each turn's annotated image. The response titled "Ecology Agent" is always the final answer returned to the user in each turn of the log.

I've also updated the README to reflect the current workflow.

Could you also please review the following?

  • src/deepforest_agent/prompts/prompt_templates.py: Let me know if anything should be adjusted, particularly to handle the hallucination.
  • The updated README workflow section: We previously discussed adding a check before executing each agent. Do you think the current setup is sufficient, or should we handle this manually for now?

During the last meet, I set the thresh = 0.55 in src/deepforest_agent/conf/config.py, but the DeepForest detection data still looks the same as before.
I tried to test it in here https://colab.research.google.com/drive/1w0O1JvyKfOohi9GBuGQa5RyLfjhq7NqM?usp=sharing
You can check the "Param 'thresh' Not working in bird_predictions" section in this colab file. Am I missing something? Should the threshold be handled in a different part of the code?

Also, large image (172.2MB .tif) causes CUDA out of memory error during Visual Agent's Qwen VL generation.
Here's the full error:
Screenshot 2025-08-04 at 9 09 09 AM

Is there any way we can handle it?

@jveitchmichaelis
Copy link

Yes big images you'd have to tile. That's how DeepForest and other remote sensing pipelines handle this. This is definitely something the interface should consider because it's rare that we only work with a single small image. But you could think about things like asking the model to look for images that have certain features, eg loop over a set of images + predictions and try to answer questions.

About the scores - if you change the threshold on the (RetinaNet) model object directly, it should work. This is fixed/tested in the latest version of DeepForest though. If you have trouble you could threshold manually.

@Samia35-2973 Samia35-2973 force-pushed the hf-integration branch 2 times, most recently from bd025e3 to 3f5146d Compare August 28, 2025 11:25
@Samia35-2973
Copy link
Author

This time I’m tiling the images and making a JSON. For the visual analysis agent, I’m tiling with patch size = 1000. If I use patch size = 400, it takes 35 seconds per tile to process and get a response from the visual agent. With large TIFFs like for 9626x6646 image:

  • 400 patch_size creates 468 tiles which takes 3+ hours total.
  • 1000 patch_size creates 77 tiles which takes about 1 hour.
  • 1500 patch_size creates 35 tiles which takes about 20–30 minutes
    But, I kept DeepForest defaults at 400 patch size as usual.

For each tile there will be visual analysis, detection data, and summary.

I thought about distributing detections to tiles because the ecology agent needs to understand spatial relationships. When a user asks "how are birds distributed relative to trees in the north section of the image?", the agent needs to know which detections belong to which spatial areas. Without this mapping, the ecology agent would have all the detections but no way to associate them with the detailed visual analysis of specific image regions. Here's how I'm distributing the DeepForest detections to each tile:

  • Take each detection from DeepForest (which uses patch size 400 for accuracy)
  • Check every tile to see if the detection's bounding box overlaps with the tile's coordinates
  • Calculate intersection - if a detection at coordinates (xmin, ymin, xmax, ymax) overlaps with a tile at (x, y, width, height), assign it to that tile
  • Mark boundary overlaps - if the detection extends beyond the tile boundaries, flag it as overlapping
  • Generate tile summary for each tile based on all assigned detections

The final JSON would look like:

{
  "session_id": "string",
  "user_query": "string", 
  "image_info": {
    "image_size": [width, height],
    "image_mode": "string",
    "image_file_path": "string"
  },
  "image_quality": {
    "image_quality_for_deepforest": "Yes/No",
    "deepforest_objects_present": ["array of strings"],
    "resolution_info": "object"
  },
  "tiles": [{
    "tile_id": "integer",
    "coordinates": {"x": "int", "y": "int", "width": "int", "height": "int"},
    "metadata": {"patch_size": "int", "overlap": "float"},
    "visual_analysis": "string",
    "additional_objects": "array",
    "tool_call_N": {
      "cache_id": "string",
      "hit_miss": "hit/miss", 
      "tool_arguments": "object",
      "tile_detection_summary": "string"
    },
    "assigned_deepforest_detections": "array"
  }],
  "detection_summary_for_the_whole_image": "string",
  "ecology_response": "string",
  "is_complete": "boolean"
}

This approach might make the JSON quite large since we're adding detection data to every relevant tile. For images with many detections, this could make the context window huge for the ecology agent. Do you think we should only keep the tile_detection_summary in each tile and put all the raw DeepForest detections at the image level outside the tiles? This would keep the spatial summaries but avoid duplicating detection data across multiple tiles.

I'm creating the tile detection summary by analyzing all assigned detections for each tile and including:

  • Object counts by type (birds, trees, livestock)
  • Classification details (alive/dead trees when enabled)
  • Confidence score ranges: Low (0.0-0.3), Medium (0.3-0.7), High (0.7-1.0)
  • Boundary overlap information for objects spanning multiple tiles
  • Tool call parameters used for that detection

Since the tiling for visual analysis takes significant time, do you think we should have some logic to automatically choose patch sizes based on image dimensions? Also, do you think we should call the visual agent at the very beginning asking "what's happening in the image" and use this response always? So that, the follow-up prompt won't run on visual agent again?

After implementing this tiling approach, the visual agent runs properly and the JSON structure we're passing works much better than before. For the ecology analysis agent, I'm now using https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct which is answering much better. I'm attaching the log files here https://drive.google.com/drive/folders/1HZiBkCy2W_D9JkSS78yAsoQhr9jqJsvy?usp=drive_link.

@jveitchmichaelis
Copy link

jveitchmichaelis commented Aug 28, 2025

I think we need an option to run the analysis first - eg allow the agent to process the results from a DeepForest run. Since it's not really feasible for users to wait hours for the predictions to run, especially if they want to re-run. This could still be launched via LLM, ie suggest the parameters to run based on the current system + data.

Then the Q/A model can load in the results. The file format is something we've been discussing recently because raw JSON isn't necessarily the best for querying if you have an enormous file - you need to index it. So it might be worth thinking about how we'd approach that, some sort of geospatial database?

As for the question of context, I'm less sure - I think the model needs to to somehow be able review at a high level and then zoom in or focus on the areas of the image that are relevant?

You could use a structure like an rtree if you're not already, which can do efficient spatial queries. Or geopandas might have some capability if you retain geographic info in the outputs?

@Samia35-2973 Samia35-2973 force-pushed the hf-integration branch 2 times, most recently from 3f94b93 to 5f52f94 Compare September 7, 2025 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants