Building a Smart Surveillance Robot: YOLOv5 Meets ESP32
The Idea
So here's the thing: I wanted to build something that combined hardware and AI in a way that actually solves a real problem. Not just another "hello world" IoT project, but something with practical applications.
The result? A remote-controlled robot with a camera that can detect people automatically, send SMS alerts with GPS coordinates, and stream video in real-time. Think disaster relief, security surveillance, or search and rescue operations.
It's called RRBot (Rescue & Reconnaissance Bot), and it's honestly one of the coolest hardware projects I've built.
Check it out:github.com/abhii2003/RRBOT
What We're Actually Building
Let me break down what this thing does:
- Remote Control: Drive it around like an RC car from a web interface
- Live Video Feed: See what the robot sees in real-time via ESP32-CAM
- AI Person Detection: YOLOv5 running on my laptop processes the video feed
- SMS Alerts: When someone's detected, sends an instant text message via Vonage API
- GPS Tracking: Stores detection coordinates in Supabase
- Pan/Tilt Camera: Two servos let you look around remotely
The Stack (Hardware + Software)
Hardware Side
- ESP32-CAM: The brain and eyes of the robot
- Motor Driver (L298N): Controls the wheels
- DC Motors: For movement
- SG90 Servos (x2): Pan and tilt for the camera
- 18650 Battery Pack: Power source
- GPS Module (optional): For real location tracking
Software Stack
- YOLOv5: State-of-the-art object detection
- Python: For the AI processing
- Vonage API: SMS notifications
- Supabase: Real-time database for tracking detections
- ESP32 Arduino: Firmware for the robot
Setting Up YOLOv5 (The Smart Part)
Why YOLOv5?
I needed object detection that was fast, accurate, and could run on a laptop (not everyone has access to GPU clusters). YOLOv5 checked all the boxes:
- Real-time detection (30+ FPS even on CPU)
- Pre-trained on 80 object classes including "person"
- Multiple model sizes (trade speed for accuracy)
- Easy to customize and integrate
- Excellent documentation from Ultralytics
Getting Started
First, clone the YOLOv5 repo and install dependencies:
git clone https://github.com/ultralytics/yolov5.git
cd yolov5
pip install -r requirements.txtThen test it out with your webcam to make sure everything works:
python detect.py --source 0 --weights yolov5s.pt --view-imgIf you see detection boxes around objects on your webcam feed, you're good to go. The first run downloads the pre-trained weights automatically (about 14MB for yolov5s).
Customizing Detection for the Robot
The Problem with Vanilla YOLOv5
The default detect.py is great for basic object detection. But I needed more:
- Send SMS alerts when people are detected
- Don't spam me with 30 texts per second
- Store detection events in a database
- Include GPS coordinates with each detection
So I had to roll up my sleeves and customize the detection script.
Key Customizations
1. SMS Integration with Vonage
First, I set up Vonage (formerly Nexmo) for SMS. It's got a generous free tier and the API is straightforward:
import vonage
client = vonage.Client(key="your-api-key", secret="your-api-secret")
sms = vonage.Sms(client)
def send_alert(lat, lon):
responseData = sms.send_message({
"from": "RRBot",
"to": "+1234567890", # Your phone number
"text": f"Person detected! Location: {lat}, {lon}"
})
if responseData["messages"][0]["status"] == "0":
print("SMS sent successfully")
else:
print(f"SMS failed: {responseData['messages'][0]['error-text']}")2. Supabase Database Integration
I needed to log every detection with a timestamp and coordinates. Supabase made this stupidly easy:
from supabase import create_client
API_URL = 'https://your-project.supabase.co'
API_KEY = 'your-api-key'
supabase = create_client(API_URL, API_KEY)
def log_detection(lon, lat):
data = {
"longitude": lon,
"latitude": lat,
"timestamp": datetime.now().isoformat()
}
result = supabase.table('detections').insert(data).execute()
print(f"Logged detection: {result.data}")3. Smart Cooldown Timer
Here's the thing: YOLOv5 runs at 30 FPS. Without a cooldown, I'd get 30 SMS alerts per second when someone walks in front of the camera. My phone would explode, and my Vonage credits would vanish.
Solution? A simple cooldown timer:
import time
last_detection_time = None
COOLDOWN_SECONDS = 40
# Inside the detection loop
for detection in results:
if detection.class_name == "person":
current_time = time.time()
# Only alert if cooldown has passed
if last_detection_time is None or current_time - last_detection_time >= COOLDOWN_SECONDS:
lat, lon = get_gps_coordinates()
log_detection(lon, lat)
send_alert(lat, lon)
last_detection_time = current_time
print(f"Alert sent! Next alert available in {COOLDOWN_SECONDS}s")Now I only get one alert every 40 seconds, no matter how many frames detect a person. Much better.
4. GPS Coordinate Generation
For the demo, I generate random GPS coordinates near a specific location. In a real deployment, you'd hook up an actual GPS module:
import random
def generate_random_location(base_lat=37.7749, base_lon=-122.4194):
# Small random offset (about 1km radius)
lat_offset = random.uniform(-0.01, 0.01)
lon_offset = random.uniform(-0.01, 0.01)
return base_lat + lat_offset, base_lon + lon_offsetFor production, you'd replace this with actual GPS data from a module like NEO-6M or u-blox.
The ESP32-CAM Setup
Why ESP32-CAM?
This tiny board is a beast. For like $10, you get:
- Dual-core 240MHz processor
- 2MP camera with decent image quality
- WiFi built-in
- GPIO pins for motor control and servos
- Runs on 5V (battery-friendly)
The only downside? No built-in USB programmer. You need an FTDI adapter or another ESP32 to flash it. Minor inconvenience for the price.
Camera Stream Setup
The ESP32-CAM creates a web server that streams JPEG frames. You access it by hitting an HTTP endpoint:
// In Arduino setup()
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
// ... more pin configurations ...
config.frame_size = FRAMESIZE_VGA; // 640x480
config.jpeg_quality = 10; // 0-63, lower means better
config.fb_count = 2;
esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
Serial.printf("Camera init failed: 0x%x", err);
return;
}
// Start web server
startCameraServer();Then from Python, you can access the stream as a video source:
# Use the ESP32-CAM stream as input
python detect.py --source http://192.168.4.1/Camera --weights yolov5s.ptMotor Control
Controlling the motors is straightforward with an L298N motor driver. The ESP32 sends PWM signals to control speed and direction:
#define MOTOR_LEFT_FWD 12
#define MOTOR_LEFT_BWD 13
#define MOTOR_RIGHT_FWD 14
#define MOTOR_RIGHT_BWD 15
void moveForward() {
digitalWrite(MOTOR_LEFT_FWD, HIGH);
digitalWrite(MOTOR_LEFT_BWD, LOW);
digitalWrite(MOTOR_RIGHT_FWD, HIGH);
digitalWrite(MOTOR_RIGHT_BWD, LOW);
}
void moveBackward() {
digitalWrite(MOTOR_LEFT_FWD, LOW);
digitalWrite(MOTOR_LEFT_BWD, HIGH);
digitalWrite(MOTOR_RIGHT_FWD, LOW);
digitalWrite(MOTOR_RIGHT_BWD, HIGH);
}
void turnLeft() {
digitalWrite(MOTOR_LEFT_FWD, LOW);
digitalWrite(MOTOR_LEFT_BWD, HIGH);
digitalWrite(MOTOR_RIGHT_FWD, HIGH);
digitalWrite(MOTOR_RIGHT_BWD, LOW);
}
void stop() {
digitalWrite(MOTOR_LEFT_FWD, LOW);
digitalWrite(MOTOR_LEFT_BWD, LOW);
digitalWrite(MOTOR_RIGHT_FWD, LOW);
digitalWrite(MOTOR_RIGHT_BWD, LOW);
}Pan/Tilt Servos
Two SG90 servos give the camera 2 degrees of freedom. You control them with standard servo library:
#include <ESP32Servo.h>
Servo panServo;
Servo tiltServo;
void setup() {
panServo.attach(2); // Pan servo on GPIO 2
tiltServo.attach(4); // Tilt servo on GPIO 4
// Center position
panServo.write(90);
tiltServo.write(90);
}
void lookLeft() {
panServo.write(45);
}
void lookRight() {
panServo.write(135);
}
void lookUp() {
tiltServo.write(45);
}
void lookDown() {
tiltServo.write(135);
}Putting It All Together
The Complete Workflow
Here's how everything works end-to-end:
- Robot Powers On: ESP32-CAM boots up, connects to WiFi, starts camera server
- Control Interface: User opens web dashboard, sees live video feed
- Remote Control: User drives robot around using arrow keys or on-screen buttons
- Video Processing: Python script grabs frames from ESP32-CAM stream
- Object Detection: YOLOv5 analyzes each frame for people
- Person Detected: If a person is found and cooldown allows:
- Generate/retrieve GPS coordinates
- Log detection to Supabase with timestamp
- Send SMS alert with location
- Start cooldown timer
- Continue Monitoring: Keep processing frames, respect cooldown
Running the Detection
To start the full system:
# 1. Power on the robot and connect to WiFi
# 2. Find the robot's IP address (check serial monitor or router)
# 3. Run the custom detection script
python detect.py --source http://192.168.4.1/Camera --weights yolov5s.pt --conf-thres 0.4
# Options explained:
# --source: ESP32-CAM stream URL
# --weights: YOLOv5 model (s=small, m=medium, l=large, x=extra large)
# --conf-thres: Confidence threshold (0.4 = 40% confidence minimum)Performance Tuning
Choosing the Right Model
YOLOv5 comes in 5 sizes. Here's what I found in testing:
| Model | Size | FPS (CPU) | Accuracy |
|---|---|---|---|
| yolov5n | 1.9 MB | ~45 FPS | Good |
| yolov5s | 14 MB | ~30 FPS | Better |
| yolov5m | 40 MB | ~20 FPS | Great |
| yolov5l | 89 MB | ~12 FPS | Excellent |
| yolov5x | 166 MB | ~7 FPS | Best |
For real-time robot surveillance, I use yolov5s. It's the sweet spot between speed and accuracy. If you have a GPU, you can easily use yolov5l or yolov5x for better detection.
Confidence Threshold Tuning
The confidence threshold determines how sure YOLOv5 needs to be before reporting a detection. I found these values work well:
- 0.3: More detections, more false positives (good for not missing anyone)
- 0.4: Balanced (what I use)
- 0.5: High confidence only, fewer false alarms
- 0.6+: Very conservative, might miss some people
Lessons Learned
1. Hardware is Harder Than You Think
Software bugs? Recompile and redeploy. Hardware bugs? Desolder, rewire, test, repeat. I burned through 3 motor drivers before I realized I was exceeding their current rating. Check your specs, use proper power supplies, and don't skimp on components.
2. Test Components Individually
Don't assemble everything and then wonder why nothing works. Test the camera first. Then add motors. Then servos. Then the detection. Incremental testing saved me hours of debugging.
3. Real-Time Constraints Are Real
When you're processing video at 15-30 FPS, every millisecond counts. I had to learn to write efficient Python code, optimize YOLOv5 inference, and minimize network latency. Profiling tools became my best friends.
4. Battery Life is Always Less Than Expected
Datasheets lie. Or rather, they give best-case scenarios. Real-world battery life is always shorter. Plan accordingly and build in buffer capacity.
5. Documentation Matters
Three months after building this, I came back to add a feature and had no idea how anything worked. Document your wiring diagrams, pin assignments, and code logic. Future you will be grateful.
What's Next?
Some improvements I'm planning:
- Autonomous Navigation: Add obstacle avoidance with ultrasonic sensors
- Face Recognition: Not just detect people, but identify specific individuals
- Edge AI: Run YOLOv5 directly on ESP32-CAM using TensorFlow Lite
- Multi-Robot Coordination: Multiple robots covering a larger area
- Better GPS: Replace random coordinates with actual GPS module data
- Voice Commands: Control the robot with Alexa or Google Assistant
- Night Vision: Add IR LEDs for low-light operation
Wrapping Up
Building RRBot was one of those projects that combined everything I love: hardware, software, AI, and solving real problems. It's not perfect - the WiFi range could be better, battery life could be longer, and detection accuracy could improve - but it works.
And that's the point. You don't need a perfect project. You need a working project that you can iterate on.
The beauty of this setup is its modularity. Don't like Vonage? Swap in Twilio. Want better object detection? Train a custom YOLOv5 model. Need faster processing? Add a GPU. The architecture supports all these upgrades without major rewrites.
If you're building something similar, hit me up on GitHub. I'd love to see what you come up with!
GitHub: github.com/abhii2003/RRBOT
Status: Prototype Complete, Continuous Improvements