Building a Smart Surveillance Robot: YOLOv5 Meets ESP32

June 2025•20 min read

The Idea

So here's the thing: I wanted to build something that combined hardware and AI in a way that actually solves a real problem. Not just another "hello world" IoT project, but something with practical applications.

The result? A remote-controlled robot with a camera that can detect people automatically, send SMS alerts with GPS coordinates, and stream video in real-time. Think disaster relief, security surveillance, or search and rescue operations.

It's called RRBot (Rescue & Reconnaissance Bot), and it's honestly one of the coolest hardware projects I've built.

Check it out:github.com/abhii2003/RRBOT

What We're Actually Building

Let me break down what this thing does:

Remote Control: Drive it around like an RC car from a web interface
Live Video Feed: See what the robot sees in real-time via ESP32-CAM
AI Person Detection: YOLOv5 running on my laptop processes the video feed
SMS Alerts: When someone's detected, sends an instant text message via Vonage API
GPS Tracking: Stores detection coordinates in Supabase
Pan/Tilt Camera: Two servos let you look around remotely

The Stack (Hardware + Software)

Hardware Side

ESP32-CAM: The brain and eyes of the robot
Motor Driver (L298N): Controls the wheels
DC Motors: For movement
SG90 Servos (x2): Pan and tilt for the camera
18650 Battery Pack: Power source
GPS Module (optional): For real location tracking

Software Stack

YOLOv5: State-of-the-art object detection
Python: For the AI processing
Vonage API: SMS notifications
Supabase: Real-time database for tracking detections
ESP32 Arduino: Firmware for the robot

Setting Up YOLOv5 (The Smart Part)

Why YOLOv5?

I needed object detection that was fast, accurate, and could run on a laptop (not everyone has access to GPU clusters). YOLOv5 checked all the boxes:

Real-time detection (30+ FPS even on CPU)
Pre-trained on 80 object classes including "person"
Multiple model sizes (trade speed for accuracy)
Easy to customize and integrate
Excellent documentation from Ultralytics

Getting Started

First, clone the YOLOv5 repo and install dependencies:

git clone https://github.com/ultralytics/yolov5.git
cd yolov5
pip install -r requirements.txt

Then test it out with your webcam to make sure everything works:

python detect.py --source 0 --weights yolov5s.pt --view-img

If you see detection boxes around objects on your webcam feed, you're good to go. The first run downloads the pre-trained weights automatically (about 14MB for yolov5s).

Customizing Detection for the Robot

The Problem with Vanilla YOLOv5

The default detect.py is great for basic object detection. But I needed more:

Send SMS alerts when people are detected
Don't spam me with 30 texts per second
Store detection events in a database
Include GPS coordinates with each detection

So I had to roll up my sleeves and customize the detection script.

Key Customizations

1. SMS Integration with Vonage

First, I set up Vonage (formerly Nexmo) for SMS. It's got a generous free tier and the API is straightforward:

import vonage

client = vonage.Client(key="your-api-key", secret="your-api-secret")
sms = vonage.Sms(client)

def send_alert(lat, lon):
    responseData = sms.send_message({
        "from": "RRBot",
        "to": "+1234567890",  # Your phone number
        "text": f"Person detected! Location: {lat}, {lon}"
    })
    
    if responseData["messages"][0]["status"] == "0":
        print("SMS sent successfully")
    else:
        print(f"SMS failed: {responseData['messages'][0]['error-text']}")

2. Supabase Database Integration

I needed to log every detection with a timestamp and coordinates. Supabase made this stupidly easy:

from supabase import create_client

API_URL = 'https://your-project.supabase.co'
API_KEY = 'your-api-key'
supabase = create_client(API_URL, API_KEY)

def log_detection(lon, lat):
    data = {
        "longitude": lon,
        "latitude": lat,
        "timestamp": datetime.now().isoformat()
    }
    
    result = supabase.table('detections').insert(data).execute()
    print(f"Logged detection: {result.data}")

3. Smart Cooldown Timer

Here's the thing: YOLOv5 runs at 30 FPS. Without a cooldown, I'd get 30 SMS alerts per second when someone walks in front of the camera. My phone would explode, and my Vonage credits would vanish.

Solution? A simple cooldown timer:

import time

last_detection_time = None
COOLDOWN_SECONDS = 40

# Inside the detection loop
for detection in results:
    if detection.class_name == "person":
        current_time = time.time()
        
        # Only alert if cooldown has passed
        if last_detection_time is None or current_time - last_detection_time >= COOLDOWN_SECONDS:
            lat, lon = get_gps_coordinates()
            log_detection(lon, lat)
            send_alert(lat, lon)
            last_detection_time = current_time
            print(f"Alert sent! Next alert available in {COOLDOWN_SECONDS}s")

Now I only get one alert every 40 seconds, no matter how many frames detect a person. Much better.

4. GPS Coordinate Generation

For the demo, I generate random GPS coordinates near a specific location. In a real deployment, you'd hook up an actual GPS module:

import random

def generate_random_location(base_lat=37.7749, base_lon=-122.4194):
    # Small random offset (about 1km radius)
    lat_offset = random.uniform(-0.01, 0.01)
    lon_offset = random.uniform(-0.01, 0.01)
    
    return base_lat + lat_offset, base_lon + lon_offset

For production, you'd replace this with actual GPS data from a module like NEO-6M or u-blox.

The ESP32-CAM Setup

Why ESP32-CAM?

This tiny board is a beast. For like $10, you get:

Dual-core 240MHz processor
2MP camera with decent image quality
WiFi built-in
GPIO pins for motor control and servos
Runs on 5V (battery-friendly)

The only downside? No built-in USB programmer. You need an FTDI adapter or another ESP32 to flash it. Minor inconvenience for the price.

Camera Stream Setup

The ESP32-CAM creates a web server that streams JPEG frames. You access it by hitting an HTTP endpoint:

// In Arduino setup()
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
// ... more pin configurations ...
config.frame_size = FRAMESIZE_VGA;  // 640x480
config.jpeg_quality = 10;  // 0-63, lower means better
config.fb_count = 2;

esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
  Serial.printf("Camera init failed: 0x%x", err);
  return;
}

// Start web server
startCameraServer();

Then from Python, you can access the stream as a video source:

# Use the ESP32-CAM stream as input
python detect.py --source http://192.168.4.1/Camera --weights yolov5s.pt

Motor Control

Controlling the motors is straightforward with an L298N motor driver. The ESP32 sends PWM signals to control speed and direction:

#define MOTOR_LEFT_FWD 12
#define MOTOR_LEFT_BWD 13
#define MOTOR_RIGHT_FWD 14
#define MOTOR_RIGHT_BWD 15

void moveForward() {
  digitalWrite(MOTOR_LEFT_FWD, HIGH);
  digitalWrite(MOTOR_LEFT_BWD, LOW);
  digitalWrite(MOTOR_RIGHT_FWD, HIGH);
  digitalWrite(MOTOR_RIGHT_BWD, LOW);
}

void moveBackward() {
  digitalWrite(MOTOR_LEFT_FWD, LOW);
  digitalWrite(MOTOR_LEFT_BWD, HIGH);
  digitalWrite(MOTOR_RIGHT_FWD, LOW);
  digitalWrite(MOTOR_RIGHT_BWD, HIGH);
}

void turnLeft() {
  digitalWrite(MOTOR_LEFT_FWD, LOW);
  digitalWrite(MOTOR_LEFT_BWD, HIGH);
  digitalWrite(MOTOR_RIGHT_FWD, HIGH);
  digitalWrite(MOTOR_RIGHT_BWD, LOW);
}

void stop() {
  digitalWrite(MOTOR_LEFT_FWD, LOW);
  digitalWrite(MOTOR_LEFT_BWD, LOW);
  digitalWrite(MOTOR_RIGHT_FWD, LOW);
  digitalWrite(MOTOR_RIGHT_BWD, LOW);
}

Pan/Tilt Servos

Two SG90 servos give the camera 2 degrees of freedom. You control them with standard servo library:

#include <ESP32Servo.h>

Servo panServo;
Servo tiltServo;

void setup() {
  panServo.attach(2);   // Pan servo on GPIO 2
  tiltServo.attach(4);  // Tilt servo on GPIO 4
  
  // Center position
  panServo.write(90);
  tiltServo.write(90);
}

void lookLeft() {
  panServo.write(45);
}

void lookRight() {
  panServo.write(135);
}

void lookUp() {
  tiltServo.write(45);
}

void lookDown() {
  tiltServo.write(135);
}

Putting It All Together

The Complete Workflow

Here's how everything works end-to-end:

Robot Powers On: ESP32-CAM boots up, connects to WiFi, starts camera server
Control Interface: User opens web dashboard, sees live video feed
Remote Control: User drives robot around using arrow keys or on-screen buttons
Video Processing: Python script grabs frames from ESP32-CAM stream
Object Detection: YOLOv5 analyzes each frame for people
Person Detected: If a person is found and cooldown allows:

Generate/retrieve GPS coordinates
Log detection to Supabase with timestamp
Send SMS alert with location
Start cooldown timer

Continue Monitoring: Keep processing frames, respect cooldown

Running the Detection

To start the full system:

# 1. Power on the robot and connect to WiFi

# 2. Find the robot's IP address (check serial monitor or router)

# 3. Run the custom detection script
python detect.py --source http://192.168.4.1/Camera --weights yolov5s.pt --conf-thres 0.4

# Options explained:
# --source: ESP32-CAM stream URL
# --weights: YOLOv5 model (s=small, m=medium, l=large, x=extra large)
# --conf-thres: Confidence threshold (0.4 = 40% confidence minimum)

Performance Tuning

Choosing the Right Model

YOLOv5 comes in 5 sizes. Here's what I found in testing:

Model	Size	FPS (CPU)	Accuracy
yolov5n	1.9 MB	~45 FPS	Good
yolov5s	14 MB	~30 FPS	Better
yolov5m	40 MB	~20 FPS	Great
yolov5l	89 MB	~12 FPS	Excellent
yolov5x	166 MB	~7 FPS	Best

For real-time robot surveillance, I use yolov5s. It's the sweet spot between speed and accuracy. If you have a GPU, you can easily use yolov5l or yolov5x for better detection.

Confidence Threshold Tuning

The confidence threshold determines how sure YOLOv5 needs to be before reporting a detection. I found these values work well:

0.3: More detections, more false positives (good for not missing anyone)
0.4: Balanced (what I use)
0.5: High confidence only, fewer false alarms
0.6+: Very conservative, might miss some people

Lessons Learned

1. Hardware is Harder Than You Think

Software bugs? Recompile and redeploy. Hardware bugs? Desolder, rewire, test, repeat. I burned through 3 motor drivers before I realized I was exceeding their current rating. Check your specs, use proper power supplies, and don't skimp on components.

2. Test Components Individually

Don't assemble everything and then wonder why nothing works. Test the camera first. Then add motors. Then servos. Then the detection. Incremental testing saved me hours of debugging.

3. Real-Time Constraints Are Real

When you're processing video at 15-30 FPS, every millisecond counts. I had to learn to write efficient Python code, optimize YOLOv5 inference, and minimize network latency. Profiling tools became my best friends.

4. Battery Life is Always Less Than Expected

Datasheets lie. Or rather, they give best-case scenarios. Real-world battery life is always shorter. Plan accordingly and build in buffer capacity.

5. Documentation Matters

Three months after building this, I came back to add a feature and had no idea how anything worked. Document your wiring diagrams, pin assignments, and code logic. Future you will be grateful.

What's Next?

Some improvements I'm planning:

Autonomous Navigation: Add obstacle avoidance with ultrasonic sensors
Face Recognition: Not just detect people, but identify specific individuals
Edge AI: Run YOLOv5 directly on ESP32-CAM using TensorFlow Lite
Multi-Robot Coordination: Multiple robots covering a larger area
Better GPS: Replace random coordinates with actual GPS module data
Voice Commands: Control the robot with Alexa or Google Assistant
Night Vision: Add IR LEDs for low-light operation

Wrapping Up

Building RRBot was one of those projects that combined everything I love: hardware, software, AI, and solving real problems. It's not perfect - the WiFi range could be better, battery life could be longer, and detection accuracy could improve - but it works.

And that's the point. You don't need a perfect project. You need a working project that you can iterate on.

The beauty of this setup is its modularity. Don't like Vonage? Swap in Twilio. Want better object detection? Train a custom YOLOv5 model. Need faster processing? Add a GPU. The architecture supports all these upgrades without major rewrites.

If you're building something similar, hit me up on GitHub. I'd love to see what you come up with!

YOLOv5ESP32-CAMPythonSupabaseVonage APIArduinoComputer VisionIoT

GitHub: github.com/abhii2003/RRBOT

Status: Prototype Complete, Continuous Improvements