Juan Pablo Lorenzo • 15 DEC 2025

Visual AI for Agriculture: Turning Harvest Footage into Real-Time Insights for G’s Fresh.

Visual AI is reshaping how agricultural producers analyze their harvests. By turning raw harvest footage into structured, real-time insights, Visual AI and video analytics AI allow teams to measure size, track individual crops, and understand yield variation directly in the field.

In this project with G’s Fresh, one of the UK’s leading vegetable producers, we built a Visual AI system that detects, segments, tracks, and measures pumpkins during harvest, unlocking a new layer of data-driven decision-making for agriculture.

Transforms Harvest Measurement

G’s Fresh is one of the UK’s largest vegetable producers and the country’s leading pumpkin supplier, growing millions every year.

Until recently, all pumpkins were treated as equals. Without precise measurements, the team couldn’t tell which fields produced larger pumpkins or where smaller ones grew. That meant no accurate pricing strategy and no data to optimize next year’s yields.

Their challenge wasn’t just counting pumpkins, it was about seeing the harvest differently: turning raw video footage into structured data that reveals how size, quality, and yield vary across the field.

At the same time, we faced a challenge of our own: making the entire AI pipeline run in real time on an edge device, where computing power and memory are limited. Achieving this required optimizing every stage: detection, segmentation, and tracking to ensure high speed without sacrificing accuracy or output quality.

That’s where we came in.

Building the Real-Time Harvest Pipeline

To help G’s Fresh make that shift, we implemented a Visual AI system powered by Computer Vision and Object Detection to automatically detect, track, count, and measure pumpkins as they move along the conveyor belt during harvest.

Overhead camera view of pumpkins moving along a conveyor belt during harvest, captured directly from the tractor before AI processing.

We began by installing a camera above the conveyor belt mounted on the tractor. As the machine moves through the field, the camera records continuous video of pumpkins passing through, and our software turns those frames into measurable data. In this way, a simple harvesting process becomes valuable information that can be later measured, optimized, and used to help producers improve their yields and operations.

Overhead view of pumpkins moving along a blue conveyor belt during harvest, with RF-DETR detection boxes and confidence scores drawn around each pumpkin as part of a Visual AI counting and measurement system.

After installing the camera and capturing continuous footage, the first step in the pipeline was detecting pumpkins as they moved along the conveyor belt. This gave us a frame-by-frame view of how many pumpkins appeared and where they were located. But while detection told us what was in each frame, it didn’t yet give us the precision needed for counting or measuring. So, detecting pumpkins wasn’t enough.

Since the goal was to count and measure each pumpkin precisely, we quickly realized that simple detection wouldn’t do. We needed to:

Detect pumpkins in each frame to locate them.
Track individual pumpkins across frames to avoid counting them twice.
Segment each pumpkin’s contour to measure its real size accurately.

In other words: detecting was just the first step, we had to identify, follow, and measure every pumpkin individually.

Video Analytics AI on the Tractor: Processing Footage in Real Time

As the tractor moves through the fields, a camera mounted above the conveyor belt captures continuous video of pumpkins being harvested. To turn this footage into reliable, real-time measurements, we rely on three core components working together: Roboflow Detection Transformer (RF-DETR) for object detection, Efficient Segment Anything Model (EfficientSAM) for segmentation, and ByteTrack for multi-object tracking.

Each frame is processed in real time:

RF-DETR detects every pumpkin in the image, locating them precisely.
EfficientSAM segments each pumpkin’s contour, giving us its exact shape.
A custom measurement tool draws the longest internal line inside each segmentation mask to estimate size.
ByteTrack ensures each pumpkin is consistently tracked across frames so it’s only counted once.

By combining detection, segmentation, measurement, and tracking, the system generates a real-time dataset that includes total pumpkin count, individual size measurements, and a complete distribution of sizes across the field.

And for producers, this dataset isn’t just information, it’s practical value. With accurate, frame-by-frame measurements, G’s Fresh can see which areas of their fields yield larger pumpkins, adjust fertilizer strategies, optimize planting decisions, and explore pricing models based on size and quality. In other words, a simple harvesting pass becomes a source of measurable insight and operational efficiency.

Segmentation, Detection, and Tracking Models Behind the System

1. RF-DETR – Object Detection.

This model identifies every pumpkin visible in each frame. It focuses on object detection, outlining bounding boxes that locate each pumpkin accurately in the image.


from rfdetr import RFDETRNano

device = 'cuda'

# Initialize model
rfdetr_model = RFDETRNano(device=device)

# Train the model
rfdetr_model.train(
   dataset_dir="pumpkins_detection_coco",  # COCO format dataset
   epochs=30,
   batch_size=16,
   lr=2e-4,
   # ... other hyperparameters
   output_dir="./output",
)

rfdetr_model.optimize_for_inference(batch_size=1)

2. EfficientSAM – Segmentation.

Once detected, the pumpkins are segmented using EfficientSAM, a lightweight version of the popular Segment Anything Model (SAM). This model isolates the precise contours of each pumpkin, allowing the system to understand not just where it is, but how big it is.

Within each segmented area, the system draws the longest possible line across the pumpkin’s surface. This becomes its estimated diameter, a reliable measure of size.


import cv2
import torch
from torchvision.transforms import ToTensor
from efficient_sam.build_efficient_sam import build_efficient_sam

# Load EfficientSAM
efficient_sam_model = build_efficient_sam(
   encoder_patch_embed_dim=192,
   encoder_num_heads=3,
   checkpoint='efficient_sam_vitt.pt',
   dtype=torch.float32,
).eval().to(device)

# Load and process image
image = cv2.imread('pumpkin.jpg')

# Step 1: Run RF-DETR for bounding box detection
detections = rfdetr_model.predict(image)
boxes = detections.xyxy  # [N, 4] in xyxy format

# Step 2: Convert image to tensor
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image_tensor = ToTensor()(image_rgb).unsqueeze(0).to(device)  # [1, 3, H, W]

# Step 3: Convert bounding boxes to point prompts
# EfficientSAM uses bbox corners as prompts: top-left (x1, y1) and bottom-right (x2, y2)
points, labels = _boxes_to_point_prompts(boxes, device)  # [1, N, 2, 2], [1, N, 2]

# Step 4: Run EfficientSAM inference
with torch.no_grad():
   predicted_logits, predicted_iou = efficient_sam_model(image_tensor, points, labels)

# Step 5: Extract masks (threshold logits at 0)
masks = (predicted_logits[0, :, 0] >= 0).cpu().numpy()  # [N, H, W]

3. ByteTrack – (Object Tracking Algorithm)
To avoid double-counting, we use the ByteTrack algorithm, which follows each pumpkin across frames and identifies when it’s the same object slightly shifted.
When a pumpkin crosses a virtual reference line on the belt, it’s counted as one unique item.

Visualization with Supervision

Once the detection, segmentation, and tracking pipeline is running and generating structured, real-time insights, the next step is making those results easy to interpret. To do that, we use Supervision, a powerful computer vision library that streamlines the visualization of detection and segmentation outputs.

Supervision provides a clean, consistent API for annotating images with bounding boxes, masks, labels, and more, making it ideal for visualizing results from models such as RF-DETR and EfficientSAM.

Beyond visualization, Supervision also integrates with ByteTrack, enabling smooth object tracking across video frames, essential for monitoring each pumpkin’s position over time.

Here’s an example of how detections from RF-DETR can be visualized using Supervision:


import cv2
import supervision as sv

# Run RF-DETR inference
detections = model.predict(image, threshold=0.2)

# Create annotators
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()

# Annotate image with bounding boxes
annotated_image = box_annotator.annotate(image.copy(), detections)

# Add confidence labels
labels = [f"{float(conf):.2f}" for conf in detections.confidence]
annotated_image = label_annotator.annotate(annotated_image, detections, labels=labels)

And with just a bit more effort, by adding EfficientSAM to segment the pumpkins, you can go beyond simple detections and create a complete visual analysis with segmentation masks and pumpkin measurements, using OpenCV to draw the longest line across each pumpkin and estimate its size.

And this is the final result of running the whole pipeline in real-time:

Optimizing the Pipeline for Real-Time Performance

To achieve real-time processing on an edge device like the Jetson Nano, we applied several performance optimizations across the entire pipeline.

A key improvement was converting the final models to TensorRT, NVIDIA’s Tensor Runtime (TensorRT) platform, designed to accelerate neural networks on GPUs. TensorRT applies techniques such as layer fusion, quantization, and memory optimizations, significantly reducing inference time and enabling high-speed execution on low-power hardware.

We also used Automatic Mixed Precision (AMP) during training and quantized the model weights to FP16 (16-bit floating point). Combined with TensorRT’s optimizations, these enhancements resulted in a 4.1× increase in frames per second (FPS), allowing the system to run detection, segmentation, and multi-object tracking simultaneously directly in the field, without relying on external servers.

Segmentation Models for Precise Measurement

To measure each pumpkin accurately, detecting them was not enough, we needed to isolate their exact contours. Why? Because bounding boxes provide location and they are too coarse for precise sizing, especially with rounded, irregular crops like pumpkins.

This accuracy really matters. The quality of the segmentation directly impacts G’s Fresh’s ability to understand yield distribution, evaluate size variations across fields, and make decisions around pricing, fertilizer usage, and future planting strategies.

But working in real-world field conditions makes segmentation challenging. Outdoor lighting changes constantly, pumpkins overlap as they move along the conveyor belt, and frames often include motion blur, all of which require a model robust enough to produce clean masks without slowing down the pipeline. Because the entire system runs on an edge device with limited GPU and memory, we needed segmentation that was both accurate and fast.

For this reason, we chose EfficientSAM, a lightweight member of Meta’s Segment Anything Model family. Unlike the original SAM, which is too computationally heavy for real-time performance, EfficientSAM offers a much smaller, faster architecture while maintaining strong segmentation quality. It also integrates seamlessly with our detector, using RF-DETR’s bounding boxes as prompts, requiring no additional training.

This enabled us to add precise measurement capabilities without compromising the pipeline's real-time throughput. By segmenting each pumpkin and then drawing the longest internal line within the mask, we obtain a consistent and reliable size estimate that feeds directly into the analytics G’s Fresh uses to optimize decision-making across their fields.

Field Insights Unlocked from Real-Time Harvest Data

The outcome went far beyond automation. With accurate counts and size measurements, G’s Fresh can now map which areas of their fields produced larger pumpkins and start exploring differential pricing strategies based on size and yield quality.

This shift from estimation to measurement created a new layer of business intelligence: knowing not just how many pumpkins they harvest, but exactly how they perform across regions.

The results were so positive that G’s Fresh is now expanding the same approach to onions and lettuces, adapting the AI pipeline to different crop types. And this is only the beginning, this marks the start of a more optimized, data-driven production process that can fundamentally transform how G’s Fresh understands, manages, and grows its harvest.

ROI of Real-Time Visual Analytics in Agriculture

Traditional harvest measurement relies heavily on manual counting, visual estimation, and post-harvest sampling. These methods are slow, inconsistent, and difficult to scale across large fields.

By automating detection, segmentation, and measurement directly on the tractor, Visual AI delivers operational and strategic benefits:

Time savings: no manual counting or sampling required during harvest.
Higher measurement accuracy: consistent size estimation across every pumpkin.
Better field decisions: producers can identify underperforming areas and adjust fertilization or planting density.
Improved pricing strategy: size distribution data enables more accurate grading and potential size-based pricing models.
Scalable investment: once implemented, the system can be reused for other crops (as G’s Fresh is now doing with onions and lettuces).

Behind the Magic: The Technology Stack

Visual AI & Computer Vision
RF-DETR (Object Detection)
EfficientSAM (Segmentation)
ByteTrack (Object Tracking Algorithm)
Custom measurement system for longest-axis estimation

Visual AI in Agriculture:

After working on this project, we were able to see the real impact of combining detection, segmentation, and tracking to transform raw harvest video into real-time analytics.

G’s Fresh now knows how many pumpkins were collected, how large they were, and where the biggest yields came from, all without manual work.

Thus, it’s not just smarter counting. It’s data-driven agriculture, powered by Visual AI.

Stay updated!

Juan Pablo Lorenzo

December 15, 2025