Stephen Miller: How To Leverage Mobile Phones And 3D Data To Build Robust Computer Vision Systems


Podcast: Play in new window | Download

Subscribe: Google Podcasts | Spotify | Stitcher | TuneIn | RSS

Stephen Miller is the Cofounder and SVP Engineering at Fyusion Inc. He has conducted research in 3D Perception and Computer Vision with Profs Sebastian Thrun and Vladlen Koltun while at Stanford University. His area of specialization is AI and Robotics, which included 2 years of undergraduate research with Prof Pieter Abbeel.

Please support this podcast by checking out our sponsors:

Episode Links:

Stephen Miller’s LinkedIn:

Stephen Miller’s Twitter:

Stephen Miller’s Website:

Podcast Details:

Podcast website:

Apple Podcasts:



YouTube Full Episodes:

YouTube Clips:

Support and Social Media:

– Check out the sponsors above, it’s the best way to support this podcast

– Support on Patreon:

– Twitter:

– Instagram:

– LinkedIn:

– Facebook:

– HumAIn Website Articles:


Here’s the timestamps for the episode:

(00:00) – Introduction

(01:42) – Started in robotics around 2010, training them to perform human tasks (surgical suturing, laundry folding). Clearest bottleneck was not “How do we get the robot to move properly” but “How do we get the robot to understand the 3D space it operates in?”

(04:05) – The Deep Learning revolution around that era was very focused on 2D images. But it wasn’t always easy to translate those successes into real world systems: the world is not made up of pixels; it’s made up of physical objects in space.

(06:57) – When the Microsoft Kinect came out; I became excited about the democratization of 3D, and the possibility that better data was available to the masses. Intuitive data can help us more confidently build solutions. Easier to validate when something fails, easier to give more consistent results.

(09:20) – Academia is a vital engine for moving technology forward. In hindsight, for instance, those early days of Deep Learning — one or two layers, evaluating on simple datasets — were crucial to ultimately advancing the state of the art we see today.

(14:48) – Now that Machine Learning is becoming increasingly commodified, we are starting to see a growing demand for people who can bridge that gap on both sides: conferences requiring code submissions alongside a paper, companies encouraging their engineers to take online ML courses, etc.

(17:41) – As we do finally start to see real-time computer vision productized for mobile phones, it does beg the question: won’t this exacerbate the digital divide? Flagship devices, always-on network connectivity: whether computing on the edge or in the cloud, there is going to be a disparity.

(20:33) – Because of this, I think the ideal model is to treat AI as one tool among many in a hybrid system. Think smart autocomplete, as opposed to automatic novel writing. AI as an assistant to a human expert: freeing them from the minutia so they can focus on high-level questions; aggregating noise so they can be more consistent and efficient.

(23:08) – Computer Vision has gone through a number of hype cycles in the last decade –real-time recognition, real-time reconstruction, etc. But the showiest of these ideas seem to rarely leave the realm of gaming, or tech demonstrator. I suspect this is because many of these ideas require a certain level of perfection to be valuable. It’s easy to imagine replacing my eyes with something that works 100% of the time. But what about 90%? At what point is the hassle of figuring out whether I’m in the 10% bucket or the 90% bucket, outweighing the convenience?