The other day when I was sitting in my home office, I got an alert from my Nest Doorbell that a package had been delivered — and right from my phone, I could see it sitting on the porch. Moments later, my neighbor dropped by to return a piece of mail that had accidentally gone to her — and again, my Doorbell alerted me. But this time, it alerted me that someone (rather than something) was at the door.
When I opened my door and saw my neighbor standing next to the package, I wondered…how does that little camera understand the world around it?
For an answer, I turned to Yoni Ben-Meshulam, a Staff Software Engineer who works on the Nest team.
Before I ask you how the camera knows what’s a person and what’s a vehicle, first I want to get into how they detect anything at all?
Our cameras run something called a perception algorithm which detects objects (people, animals, vehicles, and packages) that show up in the live video stream. For example, if a package is delivered within one of your Activity Zones, like your porch, the camera will track the movement of the delivery person and the package, and analyze all of this to give you a package delivery notification. If you have Familiar Face Alerts on and the camera detects a face, it analyzes the face on-device and checks whether it matches anyone you have identified as a Familiar Face. And the camera recognizes new faces as you identify and label them.
The camera also learns what its surroundings look like. For example, if you have a Nest Cam in your living room, the camera runs an algorithm that can identify where there is likely a TV, so that the camera won’t think the people on the screen are in your home.
Perception algorithms sound a little like machine learning. Is ML involved in this process?
Yes — Nest cameras actually have multiple machine learning models running inside of them. One is an object detector that takes in video frames and outputs a bounding box around objects of interest, like a package or vehicle. This object detector was trained to solve a problem using millions of examples.
Is there a difference between creating an algorithm for a security camera versus a “regular” camera?
Yes! A security camera is a different domain. Generally, the pictures you take on your phone are closer and the object of interest is better-focused. For a Nest camera, the environment is harder to control.
Objects may appear blurry due to lighting, weather or camera positioning. People usually aren’t posing or smiling for a security camera, and sometimes only part of an object, like a person’s arm, is in the frame. And Nest Cams analyze video in real time, versus some photos applications, which may have an entire video to analyze from start to finish.
Cameras also see the world in 2D but they need to understand it in 3D. That’s why a Nest Cam may occasionally mistake a picture on your T-shirt for a real event. Finally, a lot of what a security camera sees is boring because our doorsteps and backyards are mostly quiet, and there are fewer examples of activity. That means you may occasionally get alerts where nothing actually happened. In order for security cameras to become more accurate, we need to have more high quality data to train the ML models on—and that’s one of the biggest challenges.
On the left, an image of a dog from a Nest Cam feed on a Nest Hub. On the right, a photo of a dog taken with a Pixel phone.
So basically…it’s harder to detect people with a security camera than with a handheld camera, like a phone?
In a word…yes. A model used for Google Image Search or Photos won’t perform well on Nest Cameras because the images used to train it were probably taken on handheld cameras, and those images are mostly centered and well-lit, unlike the ones a Nest Camera has to analyze
Here’s an example of a synthesized image, with bounding boxes around synthetic cats
So, we increased the size and diversity of our datasets that were appropriate for security cameras. Then, we added synthesized data — which ranges from creating a fully simulated world to putting synthetic objects on real backgrounds. With full simulation, we were able to create a game-like world where we could manipulate room layout, object placement, lighting conditions, camera placement, and more to account for the many settings our cameras are installed in. Over the course of this project, we created millions of images — including 2.5 million synthetic cats!
We also use common-sense rules when developing and tuning our algorithms — for example, heads are attached to people!
Our new cameras and doorbells also have hardware that can make the most of the improved software and they do machine learning on-device, rather than in the cloud, for added privacy. They have a Tensor Processing Unit (TPU) with 170 times more compute than our older devices—a fancy way of saying that the new devices have more accurate, reliable and timely alerts.
So, does this mean Nest Cam notifications are accurate 100% of the time?
No — we use machine learning to ensure Nest Cam notifications are very accurate, but the technology isn’t always perfect. Sometimes a camera could mistake a child crawling around on all fours as an animal, a statue may be confused with a real person, and sometimes the camera will miss things. The new devices have a significantly improved ability to catch previously missed events, but improving our models over time is a forever project.
One thing we’re working on is making sure our camera algorithms take data diversity into account across different genders, ages and skin tones with larger, more diverse training datasets. We’ve also built hardware that can accommodate these datasets, and process frames on-device for added privacy. We treat all of this very seriously across Google ML projects, and Nest is committed to the same.