-
Notifications
You must be signed in to change notification settings - Fork 1
Adding DeepForest multi agent with HuggingFace models implementation #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9a9b819 to
2ea0460
Compare
2ea0460 to
e2dbb74
Compare
|
I’ve just pushed the latest changes. Could you please take a look when you get a chance @henrykironde @jveitchmichaelis? I've also updated the README to reflect the current workflow. Could you also please review the following?
During the last meet, I set the thresh = 0.55 in src/deepforest_agent/conf/config.py, but the DeepForest detection data still looks the same as before. Also, large image (172.2MB .tif) causes CUDA out of memory error during Visual Agent's Qwen VL generation. Is there any way we can handle it? |
|
Yes big images you'd have to tile. That's how DeepForest and other remote sensing pipelines handle this. This is definitely something the interface should consider because it's rare that we only work with a single small image. But you could think about things like asking the model to look for images that have certain features, eg loop over a set of images + predictions and try to answer questions. About the scores - if you change the threshold on the (RetinaNet) model object directly, it should work. This is fixed/tested in the latest version of DeepForest though. If you have trouble you could threshold manually. |
bd025e3 to
3f5146d
Compare
|
This time I’m tiling the images and making a JSON. For the visual analysis agent, I’m tiling with patch size = 1000. If I use patch size = 400, it takes 35 seconds per tile to process and get a response from the visual agent. With large TIFFs like for 9626x6646 image:
For each tile there will be visual analysis, detection data, and summary. I thought about distributing detections to tiles because the ecology agent needs to understand spatial relationships. When a user asks "how are birds distributed relative to trees in the north section of the image?", the agent needs to know which detections belong to which spatial areas. Without this mapping, the ecology agent would have all the detections but no way to associate them with the detailed visual analysis of specific image regions. Here's how I'm distributing the DeepForest detections to each tile:
The final JSON would look like: This approach might make the JSON quite large since we're adding detection data to every relevant tile. For images with many detections, this could make the context window huge for the ecology agent. Do you think we should only keep the tile_detection_summary in each tile and put all the raw DeepForest detections at the image level outside the tiles? This would keep the spatial summaries but avoid duplicating detection data across multiple tiles. I'm creating the tile detection summary by analyzing all assigned detections for each tile and including:
Since the tiling for visual analysis takes significant time, do you think we should have some logic to automatically choose patch sizes based on image dimensions? Also, do you think we should call the visual agent at the very beginning asking "what's happening in the image" and use this response always? So that, the follow-up prompt won't run on visual agent again? After implementing this tiling approach, the visual agent runs properly and the JSON structure we're passing works much better than before. For the ecology analysis agent, I'm now using https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct which is answering much better. I'm attaching the log files here https://drive.google.com/drive/folders/1HZiBkCy2W_D9JkSS78yAsoQhr9jqJsvy?usp=drive_link. |
|
I think we need an option to run the analysis first - eg allow the agent to process the results from a DeepForest run. Since it's not really feasible for users to wait hours for the predictions to run, especially if they want to re-run. This could still be launched via LLM, ie suggest the parameters to run based on the current system + data. Then the Q/A model can load in the results. The file format is something we've been discussing recently because raw JSON isn't necessarily the best for querying if you have an enormous file - you need to index it. So it might be worth thinking about how we'd approach that, some sort of geospatial database? As for the question of context, I'm less sure - I think the model needs to to somehow be able review at a high level and then zoom in or focus on the areas of the image that are relevant? You could use a structure like an rtree if you're not already, which can do efficient spatial queries. Or geopandas might have some capability if you retain geographic info in the outputs? |
3f94b93 to
5f52f94
Compare
5f52f94 to
641e139
Compare

This system combines DeepForest object detection with SmolLM3-3B and Qwen2.5-VL-3B to provide ecological image analysis. The system orchestrates four agents (Memory, DeepForest Detector, Visual, and Ecology) that work sequentially to maintain context from conversation history, detect birds, trees, livestock, alive/dead trees from the image, analyze both the original image and annotated image for validation, and synthesize all these data to provide ecological insights to the user.
See the README for detailed workflow diagrams and technical specifications.
Usage
1. Create and activate a Conda environment
2. Install dependencies
pip install -r requirements.txt pip install -e .3. Configure the HuggingFace token
Create a
.envfile in the root directory of the deepforest-agent project and add your HuggingFace token like below:HF_TOKEN="your_huggingface_token_here"You can obtain your token from HuggingFace Access Token. Make sure the Token type is "Write".
4. Run the interface
The DeepForest Agent runs through a Gradio web interface. To start the interface, execute:
A link like http://127.0.0.1:7860 will appear in the terminal. Open it in your browser to interact with the agent. A public Gradio link may also be provided if available.
Hardware Requirements