Hornbill Dispatch #13 | From DeepForest to Detectron2 - Getting a clear view of the trees and the forest
Part 2 of 5: How we taught AI to see trees, pipes, and everything in between
But first, a quick recap’s in order
When we last left off our riveting tale, our seemingly ingenious QR code tree tagging method had failed quite miserably and we left with “mud on our face, a big disgrace.. QR code debacle put us back into our place!” (Bonus points if you sang that line out in your head while you read it - ha!)
Chucklesome GIFs aside, while our QR code method had failed, we’d chanced upon DeepForest and were now actively using it in combination with drone imagery to count trees.
The first few climbs up our technical mountain
The DeepForest model has been primarily trained on high resolution RGB imagery at the 10cm / pixel resolution from the National Ecological Observatory Network (NEON) tree crown dataset.
The base DeepForest model took one look at our Maharashtra drone images and basically shrugged. We created orthomosaics by stitching together numerous drone images for the same plot, correcting for angles and perspective - ensuring that the final image is to scale and akin to viewing it perpendicularly from above. The model gave us precision and recall values of less than 0.5 for most images.
Precision measures how many of the objects the model predicted as trees were actually trees, while recall measures how many of the real trees the model was able to find. So a number of less than 0.5 means the model was missing nearly half the trees.
The visual characteristics of tree crowns and surrounding landscapes can vary significantly across geographic regions. As a result, the default DeepForest model, which was trained on a specific dataset, might not always perform as well when applied to new environments with different vegetation types, lighting, and background features.
The solution was transfer learning - essentially giving DeepForest a crash course in Indian agroforestry landscapes. We needed to fine-tune this pre-trained model on new data to better suit this new context. DeepForest is particularly useful in this regard, as it offers a ready-to-use structure that can be further trained using local imagery to improve accuracy and fit it to our use case.
What followed was the most tedious phase of our technical journey. We had drone imagery from ~80 land parcels that were representative of the kinds of landscapes where the model will ultimately be deployed. In the case of all 80 images, the drone was flown at an altitude of approximately 250-300 feet and the resulting overlapping images were stitched together to create georeferenced orthomosaics with a pixel resolution of about 3-4 cm. These orthomosaics were then manually annotated using the labelImg tool to mark individual tree crowns. In total, we annotated approximately 55,000 trees, and this dataset was used to retrain the DeepForest model.
The images below demonstrate the improvement in the model post training on our local data from Maharashtra.

Our team then experimented with different image sets, probability thresholds, and training epochs to finally achieve precision and recall scores close to a solid 0.9.
But why just stop at tree counting?
Once all the trees in an image are detected, the height and crown size of each tree can also be extracted. If that is possible - then our team wondered, can we also accurately estimate how much carbon each tree is removing from the atmosphere?
The height of the trees comes from using the Digital Elevation Model (DEM) feature of the drone imagery we captured. We had been using the Mavic M series of DJI made drones. This drone uses image overlap to generate elevation maps using structure from motion technique method.
This DEM thus gives the height of each pixel in the image. Once a bounding box containing each tree is obtained using a tree detection algorithm, the crown size of the tree can be easily calculated based on the size of the bounding box in pixels and the ground sampling distance Mapping this bounding box on to DEM, tree height can also be extracted.
We were now able to automatically identify trees in drone imagery, geotag each one, measure their height, and calculate crown size - all without setting foot in the field! What once took thousands of hours of manual tagging could now be done in just a couple of hours with drone flights and data processing.
Our field team was thrilled: no more mind-numbing tree counting. And our data team was just as happy: finally, clean, comprehensive data on every single tree, without having to sift through and fix thousands of messy rows.
Beyond Mountains there are Mountains
Just as the Haitian proverb predicts, no sooner had our team solved one problem that quite a few more cropped up.
Now that we were detecting trees, could we also maybe detect other things like drip irrigation pipes? We spend nearly 50% of our per plot budget on drip irrigation - snaking networks of pipes that reduced water usage and labor for farmers. But we had no efficient way to verify what was actually installed versus what was gathering dust in warehouses.
If we could identify not only the number of trees planted and surviving on farms, but also the meters of drip irrigation installed, we’d gain a much clearer view of our operations and our inventory:
how many saplings and metres of drip pipe we bought,
how many were deployed on farmer lands and,
how many lay in inventory in stock at our local warehouses?
In essence, we’d be moving from just counting trees to monitoring over 60% of our cost items with unprecedented precision.
Enter Detectron-2
Detecting irrigation pipes with bounding boxes is like trying to gift-wrap a garden hose - the box either misses half the pipe or includes half the farmland. Bounding boxes are effective for detecting objects like trees, which have relatively consistent and compact shapes. Drip lines snake across fields in irregular patterns, too long and skinny for standard object detection. We needed pixel-level precision, not approximate boxes.
Turns out, what we needed was a more precise approach called object segmentation, where each pixel of the object is identified and labeled. This method allows for more accurate mapping of the pipe’s exact shape and position on the ground, especially when they wind through varied terrain or intersect with other visual elements in the imagery.
Our search for semantic segmentation tools led us to Detectron2 - an open-source computer vision library that seemed perfectly suited for this pixel-level precision work. If it could handle the complex segmentation tasks it was designed for, maybe it could tackle our drip pipe challenge.
Detectron-2 is an open-source computer vision library developed by Facebook AI Research (FAIR) that helps computers detect and understand objects in images or videos. It can do things like find people, animals, or cars in a photo, draw boxes around them, and even outline their exact shapes. It's widely used for tasks like object detection, segmentation, and keypoint detection.
Detectron2 was going to be viable for object segmentation (or at least it provided the necessary machinery for the purpose). But when we annotated our full images with long snakelike annotation objects skirting our drip lines, the model performed very poorly. The annotations were still too lanky and each image provided only a few annotations based on the number of drip lines.
Here, some creative problem solving came to the rescue. What if we break the image into small square shaped tiles and annotate the part of the drip that was seen in that tile? This annotation would be much stubbier and when even during prediction, the model would need to predict only on one small tile at a time: much easier to predict, easier to annotate and more annotated tiles possible. This was a real breakthrough and we were able to effectively detect most of our drip lines on the images that were not used in model training.
However there was another challenge, prediction on a single image was taking 2 hours. First, the image was broken into tiles, then prediction on each tile and then recombining the results to get plot level detection! Something had to be done that allowed parallelizing this on a GPU. Here came another soldier from the world of open-source awesomeness: SAHI: A model agnostic tiling/slicing code. Now, we were able to predict in 2-3 minutes what was taking us 2 hours before. We could detect drip in a series of sections.
Detectron2 + SAHI used for drip detection had revealed something new: We didnt have to train even the tree detection model on large images. Only small sections of images could be included in the training and inference on sections of the images could be made and results recombined.
Training directly on full orthomosaic images isn't practical - they're extremely large (often 1 GB or more, with millions of pixels) and contain 300 to 800 trees each, all of which would need to be manually annotated. That would take a huge amount of time. But because the trees and backgrounds in each orthomosaic are quite similar and repetitive, annotating just a small portion is usually enough to train an effective model. This greatly reduces the annotation effort and computational load.
It was making a lot of sense then to move from using DeepForest to using a comprehensive, unified model Detectron-2 for all of small tree, big tree and drip irrigation counting and that’s what our team did.
Carbon: the final frontier
So we were in a pretty good place now - we were successfully, remotely and automatically measuring drip pipes, counting trees, their height and calculating their crown size. But something was still missing.
What if we could use the data we have to actually measure the carbon in each and every tree? And what if we could also understand how healthy each of these trees were, so that we could provide this precision advisory to our farmers to further improve their tree survival rates and fruit yields.
Just as we’d summited a few mountains, another one, and a fairly massive one at that, had emerged.
In Part 3 of 5 next week, we’ll share how a mathematician came to our rescue, how we taught AI to be honest about what it doesn't know, why it’s important to look beyond what’s visible to the eye, and also…bael candy.