I've been working on AR and related technologies for almost the last decade and I've been part of the first handful of people working on Google Glass. Bottom line I've seen a lot of promising AR technologies come and go.
My personal take on this is that they may indeed have some very good, if not revolutionary, display technology. However: The big, big obstacle to delivering credible AR is latency. Contrary to VR, true see-through AR needs to have total latencies (device motion --> display photon hits the retina) of no larger than 10 - 15 ms max. The reason is that in see-through AR you're essentially competing against the human visual system in latency and the HVS is very fast.
Moreover the HVS is also extremely good in separating visual content into "layers". Whenever two things in your field of view don't move in perfect continuity with their surroundings (as it is when there AR content overlaid with latency) your brain will immediately separate them from one another, creating the impression of layers, and, in the case of see-through AR, breaking the AR illusion.
So right now I'm a semi-believer. Iff they can sort out the latency problem and deliver stable yet ultrafast tracking in a wide variety of conditions (also by far not a trivial problem) then this has a bright future.
The first iteration of a good AR system could simply sidestep the latency issue by embracing layers.
Magic Leap should skip the fancy stuff (mixing virtual scenes with real), at least at first, and focus the many other useful features of a great head-mounted display system - think mobile notifications, video calls, web browser, etc.
It could easily replace smart watches and later cell phones and computer monitors without solving the latency issue.
Would it be possible to artificially delay the world by 15ish ms? A person would have to fully have a headset on (so it'd be more like VR than AR), but perhaps it could deliver a time-delayed view of the world only once the augmented pieces are ready to render.
Edit: you'd still have the motion-sickness challenge, but perhaps at least the 'layers', so-to-speak, wouldn't appear separately.
No. The important thing is keeping your sensory inputs in sync with your vestibular system. There were some research questions about hacking the vestibular system a few years ago.
But in VR we can have even lower latencies for synthetic content.
Because we have the head tracker recent history, we use prediction on pose trajectory, and can effectively know where the head pose will be at the time the current rendered frame will be displayed. And use that predicted pose to render the scene. That type of optimization won't be possible with see-through VR or AR.
The second optimization is timewarp, where the rendered scene is distorted in screen space after the fact, based on post-render tracker data (just a few ms before display). I wonder if that type of optimization would create artifacts in AR.
Since you're an expert: what about these videos is hard? The things that jumped out at me are:
1 - the robot moving behind the table leg (ie you have to do depth recognition of objects in the scene)
2 - the user's hand interacting with the artificial elements in the scene. Some code had to recognize a hand and figure out which element it was touching.
What strikes you as the hard parts of those videos besides the real-time requirement?
Well the second video is a mock-up. In the first video notice that a) the observed things are floating in space and b) the camera motion is very smooth. This is how they sidestep the "layering problem" in the video. The desk leg occluding the robot is probably done using a depth sensor.
These two things are non-trivial, but not particularly hard in themselves. However, doing them at ultra-low latency becomes quite a challenge. Doing anything at ultra-low latency is already a challenge, but especially so when what you're trying to do is running a deep neural net for entity recognition or gesture recognition.
Training an ANN is computationally intensive, using a trained ANN is not. No context switching for system calls, no memory management, just matrix math.
well, first you need to know what image regions feed to ANN, and that can involve some segmentation and pre-recognition, otherwise you're going to evaluate the net at all feasible subwindows — and that's a LOT of matrix math for you. Very big GPU can help, but they have latency in themselves, and FPGA at such performance levels are inordinately expensive.
Done at scale though ASICs seem to be the sure-to-work way.
I'd be very surprised if a modern cpu couldn't handle the task, especially if you were clever about detecting regions of interest, predicting head movement and cache maintenance. But I'd also be surprised if they go to market with an x86 under the hood.
I remember reading a while ago about how smart tvs were using ANNs for upscaling, so it has been done at scale. rimshot
(1) TVs don't have strict latency requirement. I've hard latencies of 100 ms are common.
(2) Upscaling ANNs process rather small image neighborhood radius, and required processing power is on the order of O(r² * log r), and if a minimally recognizable cat is 50x50 px and for upscale you use a very large window of 16x16, that's 14 times already.
Latencies of 100 ms may be common because TVs don't have strict latency requirements.
16x16 is a very small window, I have no idea what they're using for TVs, but 128 isn't uncommon in post production ANN upscaling. Also consider the fact that ANNs have not received anywhere close to the level of attention in optimization that compilers have, so there is also a lot of potential slack to be taken up if real time processing demands it.
1 or have premade 3d environment model and do accurate position tracking. Position tracking is a LOT easier to do realtime.
2 bullshit CGI "this is how we hope it would look like if it was real" demo
Few months ago their apparatus was one color only, stationary and the size of a desk. Now all of a sudden can be strapped to a camera and does colors? color me sceptical :(
My personal take on this is that they may indeed have some very good, if not revolutionary, display technology. However: The big, big obstacle to delivering credible AR is latency. Contrary to VR, true see-through AR needs to have total latencies (device motion --> display photon hits the retina) of no larger than 10 - 15 ms max. The reason is that in see-through AR you're essentially competing against the human visual system in latency and the HVS is very fast.
Moreover the HVS is also extremely good in separating visual content into "layers". Whenever two things in your field of view don't move in perfect continuity with their surroundings (as it is when there AR content overlaid with latency) your brain will immediately separate them from one another, creating the impression of layers, and, in the case of see-through AR, breaking the AR illusion.
So right now I'm a semi-believer. Iff they can sort out the latency problem and deliver stable yet ultrafast tracking in a wide variety of conditions (also by far not a trivial problem) then this has a bright future.