Hud Companion

Well, I’ve been busy reading Szeliski, refreshing calculus and linear algebra. The good old three-blue-one-brown has fantastic videos on all the topics, fantastic to convince yourself that this time you got it. As I'm going through the book, I keep having ideas pop up. It’s also hilarious how every chapter follows the same formula:

  • Dozens of pages of complicated math and smart classical methods
  • Deep learning shows up and solves it with good vibes.

What a time to be alive.

But this post isn’t about that.




I stumbled on this gem:

This is ridiculously cool. Naturally, I was tempted to reproduce it!


And so...

Attempt 1: Track the Apple Watch

I hadn’t touched ARKit in a few years, but surely object detection could find a rectangular glowing device strapped to my wrist?

nope.

Turns out:

  • The Apple Watch is too reflective and glossy.
  • I occlude it with my hand all the time.
  • Apple has been busy with Apple Intelligence.

Image detection was also a failure. My watch stubbornly refused to identify itself.

Attempt 2: Hand tracking

Fine. If I can’t reliably detect the watch, I’ll surely be able to detect the hand that is attached to the watch, right? right?.

Well it turns out that ARKit on iOS does not give 6DoF hand skeletons (visionOS does, because of course it does).

Buut! iOS does have the Vision hand-pose estimation API. The Vision framework is Apple’s on-device CV toolkit for tasks like detection, particularly DetectHumanHandPose runs a model that identifies human hands in a frame and returns up to 21 labeled 2D landmarks per hand (4 per finger + the wrist).


So the pipeline became:

  1. Run hand-pose detection on every frame (2D keypoints).
  2. Take the relevant joint pixel position.
  3. Project into 3D using ARKit camera intrinsics.
  4. Discover that the depth estimation is… enthusiastic, but not correct.
  5. Realize that ARKit gives you LiDAR depth per frame.
  6. Sample the depth at the joint locations instead.
  7. Magic happens. Things stop vibrating like jaws in Berghain at 5am.

I originally tried using the wrist joint, but switched to the thumb (IP) joint so I could hold the device:

Pretty good no?

After rewatching the dmvrg video, I'm pretty sure he's also tracking the interphalangeal thumb joint.

His demo hides it with excellent skill.



This could get interesting with AR glasses

The more I played with this, the more I realized:

This would be ridiculously cool paired with Vision Pro or any AR glasses.

Tiny HUDs anchored to real world objects.

"ObjectHUD" Protocol

A discovery + pose + HUD system for arbitrary real-world objects


We already have human-pose estimation, hand tracking, and environment mapping. But what if objects could advertise HUDs?

Your AR glasses would:

  1. Discover objects on the local network

    via mDNS or BLE advertisements: "I support ObjectHUD"

  2. Confirm identity invisibly

    using UWB, BLE direction-finding, IR-reflective markers, RFID signatures...

    basically anything invisible to humans but readable by sensors.

  3. Compute object pose

    by fusing:

    • UWB distance and angle
    • SLAM camera pose
  4. Spawn a contextual HUD when you glance at the object

    The HUD is literally a webpage in a 3D plane, anchored to the object’s pose.


Examples:

  • Look at your Apple Watch → notifications, timers, heart rate.
  • Look at your keyboard → battery level.
  • Look at your router → red screen (internet in SF is bad).
  • Look at your fridge → "pls don't open me again, there are no snacks"

Properly designing and implementing this protocol is for another time.

← Back to projects