Industrial AI & Computer Vision for Factory Automation

The Problem

Manufacturing floors are complex, dynamic environments where seconds matter. A conveyor belt stalls. A worker signals for assistance. A quality control checkpoint needs immediate attention. Traditional surveillance systems record everything but understand nothing—drowning operators in footage without insight.

At Haig's Quality Printing, a commercial printing facility in Las Vegas, I faced this exact challenge: How do you build intelligence into industrial monitoring? How do you transform passive cameras into active collaborators that recognize machines, track gestures, and alert operators before problems escalate?

The answer wasn't just deploying off-the-shelf AI models—it required building a unified platform that combined real-time video surveillance, custom object detection, gesture recognition, and conversational AI into a system that factory workers could actually use.

System Architecture

The platform integrates four distinct AI capabilities into a modular Python application:

Multi-Stream Video Surveillance: RTSP protocol monitoring up to 4 simultaneous camera feeds in real-time 2×2 grid display
YOLO Object Detection: Factory-specific machine identification using Ultralytics YOLOv8 nano/small models
MediaPipe Hand Tracking: 21-point hand landmark detection for gesture-based control interfaces
GPT-2 Chatbot: Natural language interface for system queries and operational insights

Each subsystem runs independently but shares a common data pipeline: video frames flow from RTSP streams through OpenCV capture buffers into inference engines (YOLOv8 or MediaPipe), with results rendered back to display surfaces at target framerates.

Custom YOLO Training for Factory Machines

Generic object detectors recognize cats, cars, and people—but they know nothing about industrial printing equipment. I needed a model that could differentiate between a Roland DG press, a paper cutter, and a binding machine in real-time.

Data Collection & Labeling: Captured 1,200 images across different lighting conditions, camera angles, and machine states. Used Roboflow for annotation, creating bounding boxes and class labels for 8 machine types. Applied augmentation (rotation, brightness adjustment, noise injection) to reach 3,000 training examples.

Model Selection & Training: Chose YOLOv8 Nano (6.5MB) for its balance of speed and accuracy. Trained for 100 epochs on NVIDIA GPU with batch size 16, learning rate 0.01, achieving:

mAP@0.5: 87.3% (mean Average Precision at 50% IoU threshold)
Inference Speed: 100 FPS on RTX 3060, 20 FPS on CPU (Intel i7)
Model Size: 6.5MB (deployable on edge devices)

The lightweight architecture enables real-time detection on modest hardware—critical for factory deployment where dedicated AI accelerators aren't always available.

MediaPipe Hand Tracking for Gesture Recognition

Factory workers wear gloves, work in varying lighting, and need hands-free controls. MediaPipe's hand landmark detection uses a two-stage pipeline:

Palm Detection: Lightweight CNN localizes hand regions in the frame
Landmark Regression: Extracts 21 3D coordinates (fingertips, knuckles, wrist) with sub-pixel accuracy

I extended the base MediaPipe implementation to export landmark time-series data to CSV format, enabling:

Gesture Classification: Train custom models on collected gesture sequences (wave, point, thumbs-up)
Ergonomic Analysis: Track repetitive motion patterns for workplace safety assessments
Quality Control Signaling: Workers signal pass/fail decisions via standardized hand gestures

The system tracks up to 2 hands simultaneously at ~30 FPS on GPU, robust to partial occlusions and rapid movements.

Multi-Stream RTSP Surveillance

Industrial cameras communicate via RTSP (Real Time Streaming Protocol)—essentially HTTP for video. The challenge: synchronize 4 independent streams, decode them in real-time, and composite them into a unified display without frame drops or latency spikes.

Implementation: Used OpenCV's VideoCapture with threading—each stream runs in a dedicated thread with its own frame buffer. A main rendering loop polls buffers at 30Hz, resizes frames to 640×480, and arranges them in a 2×2 grid (1280×960 output).

Critical optimizations:

Buffer Management: Circular buffers (size 3) prevent memory leaks during transient connection issues
Frame Dropping: If decode time exceeds inter-frame interval, skip frames rather than queuing—maintaining real-time responsiveness
FPS Overlay: Display per-stream framerates for immediate diagnostics
Automated Recording: Trigger recording on motion detection or manual command, storing H.264-encoded MP4 files

Conversational AI for Factory Insights

Beyond visual AI, I integrated a GPT-2 based chatbot for natural language queries: "What machines are running?", "Show me today's error log", "How many units produced this shift?"

The chatbot uses Hugging Face Transformers with a fine-tuned GPT-2 small model (124M parameters), trained on factory-specific terminology and common operator questions. Deployed locally to avoid cloud latency and maintain data privacy—critical for industrial IP protection.

Performance Metrics & Real-World Deployment

Hardware Setup:

Intel i7-10700K (8 cores, 16 threads)
NVIDIA RTX 3060 (12GB VRAM)
32GB DDR4 RAM
1TB NVMe SSD (for video buffering)

Achieved Performance:

Multi-stream surveillance: 30 FPS per stream (4 streams simultaneous)
YOLO object detection: 100 FPS (GPU), 20 FPS (CPU)
Hand tracking: 30 FPS (GPU), 15 FPS (CPU)
End-to-end latency: 45ms (camera → detection → display)

Operational Impact: Deployed for 6 weeks during summer internship, monitoring 4 production lines across 12-hour shifts. Key outcomes:

Reduced machine downtime identification from minutes to seconds
Automated quality control checkpoints with gesture-based signaling
Provided historical video archives for incident investigation
Enabled remote monitoring for shift supervisors via network streams

Engineering Challenges & Solutions

Challenge 1: Network Bandwidth Saturation
Four 1080p RTSP streams consumed 40 Mbps—overloading factory WiFi. Solution: Negotiated lower resolution (720p) and compression (H.264 CRF 28), reducing bandwidth to 12 Mbps with acceptable quality loss.

Challenge 2: Lighting Variability
Factory lighting changes dramatically between day/night shifts, degrading detection accuracy. Solution: Trained YOLO model with synthetic lighting augmentation and deployed real-time histogram equalization pre-processing.

Challenge 3: Occlusion Handling
Workers frequently block camera views. Solution: Implemented multi-camera fusion—if one view is occluded, prioritize detections from alternate angles with higher confidence scores.

Challenge 4: Model Drift
New machine models introduced mid-deployment weren't recognized. Solution: Built data collection pipeline—operators flag misclassifications, creating labeled examples for incremental retraining.

Technical Resources

Full source code, trained YOLO models, and deployment instructions available on GitHub:

→ View on GitHub

Repository includes:

haigs_app/ — Main surveillance and detection application
object_detection/ — YOLO training scripts and pre-trained weights
video_streaming/ — RTSP stream testing utilities
chat_bot/ — GPT-2 chatbot integration
requirements.txt — Python dependencies (OpenCV, ultralytics, mediapipe, etc.)

Future Directions

This project demonstrated feasibility of integrated AI systems in industrial environments. Potential extensions:

Predictive Maintenance: Use time-series analysis on machine vibration patterns (via video motion amplification) to predict failures
Automated Defect Detection: Train semantic segmentation models to identify product defects in real-time
Worker Safety Monitoring: Detect PPE compliance (helmets, gloves) and unsafe behaviors (proximity to machinery)
Production Analytics: Integrate with MES (Manufacturing Execution Systems) for automated throughput tracking
Edge Deployment: Port to NVIDIA Jetson Nano for decentralized processing at each camera node

Industrial AI & Computer Vision Platform