Deep dive15 min read← Back to crisp

ViewR - AWS Rekognition and RTSP

Full deep-dive on the ViewR vision and camera stack - RTSP frame sampling with FFmpeg, local pre-filtering with MediaPipe, Rekognition Collections for matching, the de-dup and threshold logic, and the ONVIF PTZ implementation.

The pipeline at a glance

The ViewR pipeline took raw RTSP frames from IP cameras at the customer site and turned them into "this person walked through this door at this time" events on a dashboard. End to end the chain was -

RTSP capture from the camera over LAN.
FFmpeg decode and frame sampling at 2 FPS.
Local face detection via MediaPipe to filter out empty frames.
Crop the face, send to Rekognition for matching against the customer's collection.
De-duplicate matches, apply per-customer thresholds.
Write the entry event to DynamoDB and push to the dashboard.

The whole pipeline ran inside the Electron app on a Windows desktop at the customer site. The only cloud calls were the Rekognition matching and the event writes. That kept latency low, bandwidth manageable, and avoided sending video frames over the customer's office uplink.

RTSP - the protocol

RTSP (Real Time Streaming Protocol, RFC 7826) is the standard for IP camera streaming. It is also a mess. Every vendor implements it slightly differently, authentication is over a basic-auth or digest-auth scheme that leaks credentials in pcaps, and the actual video payload is RTP over UDP (or interleaved over the RTSP TCP connection, depending on the camera).

I did not write an RTSP client. FFmpeg has one that is more battle-tested than anything I could write in a weekend. The Electron app spawned an FFmpeg child process per camera with arguments like -

Input: the RTSP URL.
Codec: copy or decode depending on what we needed.
Output: piped raw frames at a reduced frame rate to the parent process.

ffmpeg -i rtsp://user:pass@cam.local/stream1 -vf fps=2 -f image2pipe -vcodec mjpeg pipe:1

That command samples the camera at 2 FPS, encodes each frame as JPEG, and pipes it to stdout. The Electron parent process reads JPEGs off stdin, splits them on the JPEG SOI marker, and processes each frame.

Why 2 FPS, not 30. Two reasons. First, we do not need 30 FPS for face detection at a door (a person walking through is in frame for 1 to 3 seconds, 2 FPS captures 2 to 6 frames of them). Second, Rekognition costs per call. 30 FPS times 10 cameras times 1 second equals 300 calls per second, which would melt the budget. 2 FPS times 10 cameras equals 20 calls per second worst case, manageable.

Local pre-filtering

Most frames from an office camera have no face in them. The hallway is empty. Sending every frame to Rekognition would be wasteful.

I ran a cheap local face detector on every sampled frame using MediaPipe in a Web Worker. MediaPipe's face detector is fast (a few ms per frame on a modern CPU) and small (no GPU required). If MediaPipe found no face, we dropped the frame. If it found a face, we cropped to the bounding box (with 20 percent padding) and sent only the crop to Rekognition.

This cut Rekognition calls by 70-80 percent on typical office camera placements. Some cameras (entry vestibules with high foot traffic) had less of a savings. The cost win was real.

Local pre-filter cuts cloud calls by 70-80 percent.

Rekognition Collections

A Rekognition Collection is a set of face embeddings tied to external IDs. The two operations that mattered -

IndexFaces - takes an image and an external ID, extracts the face, stores the embedding under that ID. This is enrollment.
SearchFacesByImage - takes a query image, returns the top K matching faces with similarity scores.

One collection per customer organization. The collection size was small (hundreds to low thousands of faces). At that scale, Rekognition's search latency was tens of milliseconds, well under the budget for a "person walks through door, alert appears on dashboard" interaction.

Enrollment had its own UX. The customer admin uploaded a photo per person. We ran IndexFaces and tagged the face with the user ID. We also stored the original photo in S3 for later (in case we needed to re-enroll with better parameters). For people who were going to be in a lot of camera angles, we encouraged 2-3 enrollment photos with different angles.

De-duplication

Without de-duplication, a person standing near a door would generate dozens of "entry events" as they were detected on every sampled frame. The dashboard would be unusable.

The rule I implemented - any match for the same person within a 30-second window at the same camera is the same event. The first match wins (it has the cleanest timestamp), subsequent matches within the window update the event with their highest-confidence frame and extend the window if needed.

I held the de-duplication state in memory per camera, with a TTL cache. On app restart, recent state was lost, which was fine - at worst it caused a duplicate event right after restart, never the other way around.

Threshold tuning

Rekognition returns a similarity score from 0 to 100. The question is - what is the cutoff for "this is a match" versus "this is an unknown"?

I started at 90 percent globally. Some customers needed 95 percent because they had a lot of physically similar people (a school class with similar age and ethnicity will have lower differentiation than a mixed office). Some customers worked at 85 percent because their populations were diverse.

The tuning workflow - for each new customer, run the system for a week in "shadow mode" (matches recorded but no alerts), have the customer manually classify a sample as match or not match, fit the threshold to their data. After a week of shadow mode I had per-customer thresholds I could trust.

ONVIF and PTZ

ONVIF (Open Network Video Interface Forum) is the standard for IP camera control. Profile S covers basic streaming and PTZ. The protocol is SOAP over HTTP, which is the kind of sentence that should not exist in 2025 but here we are.

I wrote a thin ONVIF client. Not a full SDK - the existing Node ONVIF libraries had heavy dependencies and inconsistent vendor support. My client supported the 3 specific vendors we had in the field and the 4 operations that mattered - GetDeviceInformation, GetStreamUri, GetPTZConfiguration, ContinuousMove.

The PTZ math was the unexpected hard part. The customer wanted operators to click a face thumbnail on the dashboard and have the camera physically point at where that face was. That meant -

Given a face bounding box in the original frame, compute the (x, y) center.
Given the camera's known field of view and current pan-tilt, compute the pan and tilt delta to center the face.
Compute a zoom delta based on the face size relative to the frame (small face = zoom in more).
Send the deltas as a relative PTZ move.

I learned a lot of trigonometry I had forgotten. The math worked. PTZ responses from cameras took 200ms to a few seconds depending on the camera, which set the UX expectation. The operator clicked, the dashboard showed "zooming...", the camera moved.

RTSP security and the LAN constraint

A note on security. RTSP authentication is weak. The credentials travel in the URL or in HTTP basic-auth headers, depending on the camera. RTSP over the public internet is a bad idea - credentials can leak, the protocol is not encrypted by default, and the bandwidth would kill the customer's office uplink.

My constraint - the Electron app always ran on the same LAN as the cameras. The only thing that crossed the public internet was the API calls to AWS (Rekognition, DynamoDB), which were authenticated with IAM and went over HTTPS. The video stayed local.

For customers who wanted remote monitoring (a security manager checking the dashboard from home), I shipped a web view of the dashboard hosted in AWS that read from DynamoDB. The video itself never left the customer's network.

Costs

Rekognition pricing is per API call plus per stored face. For our pilot scale (a few thousand faces total, a few thousand calls per day per customer) the cost was a few dollars per customer per month. Negligible compared to our pricing.

If we had scaled to dozens of cameras per customer and tens of thousands of daily entries, the cost picture would have changed and we would have needed to look harder at local inference. For pilots, hosted Rekognition was the right call.

What I would do differently

I would have added a recorder service from day one. We stored events but not the original frames that triggered events. When a customer disputed a match, we had no evidence. I added frame storage in month 4 - I should have done it in week 1.

I would have built a proper enrollment UI earlier. The first version was a CSV upload of (user ID, photo path) pairs, which was developer-friendly and customer-hostile. The second version was a drag-and-drop UI with face preview and angle suggestions. Customers loved it. Should have shipped that first.

What this taught me

Computer vision in production is 10 percent the model and 90 percent everything around the model - capture, sampling, pre-filtering, thresholding, de-duplication, alert routing, enrollment UX, threshold tuning, dispute handling. The model is a commodity now. The product is what you build on top.

That lesson applies to LLMs too. The frontier model is rented. The product is the structured workflow on top.

Learn more

Docs
AWS Rekognition Collections docsAWS
Docs
AWS Rekognition pricingAWS
Docs
ONVIF Profile S specONVIF
Docs
FFmpeg docsFFmpeg
Docs
RFC 7826 - RTSP 2.0IETF
Docs
MediaPipe face detectionGoogle